Can we collect too much data?

Scientists regularly call for "more data" to better establish experimental results. But can we collect too much? Recent studies in personal genomics and particle physics illustrate the benefits and costs of collecting it all.
Written by Hannah Waters, Weekend Editor

In just about every article about scientific research, you'll see this line: "The results are promising, but I'd like to see more data." But does more data necessarily mean more information?

Take the large handron collider (LHC), for example. The enormous physics experiment produces roughly 15 petabytes (15 million gigabytes) of data annually, which is "enough to fill more than 1.7 million dual-layer DVDs a year," according to CERN. And while that seems like far more than any team of humans can handle, some physicists think even more data should be collected and analyzed.

Right now the experiment's focus is on finding the Higgs boson, and the LHC's computers are set to automatically collect only those data that suggest its existence based on physicist's theories. But at a meeting in Italy last week, one physicist made the point that the collider should be doing more. "It could be the situation a year from now that nothing will be found at the LHC other than the Higgs," Tomer Volansky of Tel Aviv University in Israel told New Scientist.

Volansky says we should look for signs of more exotic - and unlikely - physics, such as a new force beyond the four we already know. "We should drop our prejudice and look for anything that is possible," he says. "If we won't check, we won't know."

But other researchers argued that storing all that data is "impractical." Even now, the LHC is only able to store what it collects because of a grid of 170 computer centers in 34 countries that share the load between them. And even if the data is in hand, how would the researchers search through it?

The fields of biology and medicine are facing similar problems. Researchers collect increasingly more genomic data, but little information is gleaned from the 150 gigabytes of a single human genome, and comparative genomic studies have not lived up to the hype. And that's not considering lab research, where each experiment is replicated and often investigates biological molecules associated with DNA in more detail.

However, not all is doomed. Last week, geneticist Michael Snyder published 2.5 years of his own personal data, including his genomic sequence, and RNA, protein, metabolic and auto-antibody profiles taken 20 times over a 14-month period. And by analyzing the data on a timeline, he discovered that he was at risk for type 2 diabetes, despite having no family history.

Snyder is just one person; could his success in identifying disease by collecting immense amounts of biological data be, in itself, a fluke of data? As Nature reported:

"A criticism of this paper is that it’s anecdotally about one person, but that’s also its strength,” says geneticist George Church of Harvard Medical School in Boston, Massachusetts. Large-scale association studies trying to tease out the genetic variants of complex disease often make the mistake of trying to achieve statistical significance by “lumping together an enormous number of people that don’t necessarily belong together”, says Church.

Snyder's story illustrates that good can come out of collecting large amounts of data; but the LHC warns that there can be too much, both for actually gleaning real information and storage capabilities.

The main reason, in my mind, for collecting all the data is for the aid of future researchers: as long as you're doing a high-cost experiment, why not try to trap as much potential information as possible? But, as with many things, the up-front investment may not seem worth the long-term gain from the beginning.

Photo: Flickr/Kenny Louie

This post was originally published on Smartplanet.com

Editorial standards