X
Business

LHC, Genomics and the era of massive data sets

The world won't end when the Large Hadron Collider makes its anticipated first collision of high-energy particles later this month, but it will produce an explosion of data. The LHC will produce as much data in a single day of experimentation as many scientists dealt with in a lifetime only a decade ago.
Written by Mitch Ratcliffe, Contributor

The world won't end when the Large Hadron Collider makes its anticipated first collision of high-energy particles later this month, but it will produce an explosion of data. The LHC will produce as much data in a single day of experimentation as many scientists dealt with in a lifetime only a decade ago. Genomics data, from protein-folding to RNA transcription, generates massive data sets that, if widely analyzed, could produce fantastic discoveries at a pace that puts industrial era innovation to shame by comparison. A great read on the Petabyte Era in science is Cory Doctorow's "Welcome to the Petacentre"—you don't get to read about dozens of robot librarians running hundreds of 10PB libraries every day, it's pretty cool.

Data is the key to discovery. The rapid sharing of data is the accelerator of innovation. But the scientific community and industry have long tried to keep data sets closed. For example, the journal Nature reports that an Italian team looking into the existence of dark matter has held their data so closely that digital cameras that caught a slide describing the data during a conference have been used to smuggle it into publication in advance of the team's research. For many scientific journals, data must be protected prior to first publication.

We need to find new ways to ensure scientific credit is preserved even as data is shared, and that economic opportunity is distributed widely so that entrepreneurial scientists and companies can benefit from their contributions. The advent of wiki usage in scientific communities is already producing unexpected benefits, from faster analysis of data to novel interpretations and discoveries, according to Nature's Big Data Issue, which is freely accessible on the Net.

The result of keeping data secret is fabulous wealth in some cases, but in others it delivers untested technologies and drugs, among other things, to an unsuspecting public. Testing of theories and findings is the core of the scientific process, something that the pursuit of secrecy actually impairs. And because product safety has been discounted so dramatically during the past eight years, it is time to recognize that the public needs access to all the data describing any product so that they can organize new approaches to testing the claims that back product marketing.

Without standards, all that data is a mishmash that adds complexity without value. Without access, all that data is like unmined iron that will never become tomorrow's steel beams and infrastructure in the knowledge economy. The former is slowly being dealt with, the latter is slowly being undermined by network operators and the culture of secrecy.

The trend toward capping Internet usage based on misleading pricing schemes, such as Comcast's 250GB monthly limit on data transfers over its network, erects barriers to participation in the knowledge economy. A single day's data output from the LHC or a protein-folding data set can exceed 250GB, and we've seen the benefits in distributed computing projects, such as "protein folding at home," in which spare bandwidth and compute cycles can be applied to solving complex problems.

But more problematic is the tendency to keep data sets secret. What many people don't realize when they say "Of course, companies need to keep data secret," is that the same data can be applied to many problems, most of which the company that owns the data isn't even seeking. For instance, a statistician can use large data sets from several different sources to test probabilistic formulas that can be applied in economics without encroaching on the research for which the data was originally intended. But there are also many cases where drug testing data includes pointers to unrecognized receptors in the human body that both the originator of data and the discoverer of the novel receptor can benefit from.

If we don't open this scientific boon to everyone, though, by making it easy and cheap to share data—the way the Web was designed to do—the benefits of the Petabyte Era are going to be limited, lop-sided and concentrated in a few centers of research rather than distributed throughout the economy.

Editorial standards