LHC, Genomics and the era of massive data sets

LHC, Genomics and the era of massive data sets

Summary: The world won't end when the Large Hadron Collider makes its anticipated first collision of high-energy particles later this month, but it will produce an explosion of data. The LHC will produce as much data in a single day of experimentation as many scientists dealt with in a lifetime only a decade ago.


The world won't end when the Large Hadron Collider makes its anticipated first collision of high-energy particles later this month, but it will produce an explosion of data. The LHC will produce as much data in a single day of experimentation as many scientists dealt with in a lifetime only a decade ago. Genomics data, from protein-folding to RNA transcription, generates massive data sets that, if widely analyzed, could produce fantastic discoveries at a pace that puts industrial era innovation to shame by comparison. A great read on the Petabyte Era in science is Cory Doctorow's "Welcome to the Petacentre"—you don't get to read about dozens of robot librarians running hundreds of 10PB libraries every day, it's pretty cool.

Data is the key to discovery. The rapid sharing of data is the accelerator of innovation. But the scientific community and industry have long tried to keep data sets closed. For example, the journal Nature reports that an Italian team looking into the existence of dark matter has held their data so closely that digital cameras that caught a slide describing the data during a conference have been used to smuggle it into publication in advance of the team's research. For many scientific journals, data must be protected prior to first publication.

We need to find new ways to ensure scientific credit is preserved even as data is shared, and that economic opportunity is distributed widely so that entrepreneurial scientists and companies can benefit from their contributions. The advent of wiki usage in scientific communities is already producing unexpected benefits, from faster analysis of data to novel interpretations and discoveries, according to Nature's Big Data Issue, which is freely accessible on the Net.

The result of keeping data secret is fabulous wealth in some cases, but in others it delivers untested technologies and drugs, among other things, to an unsuspecting public. Testing of theories and findings is the core of the scientific process, something that the pursuit of secrecy actually impairs. And because product safety has been discounted so dramatically during the past eight years, it is time to recognize that the public needs access to all the data describing any product so that they can organize new approaches to testing the claims that back product marketing.

Without standards, all that data is a mishmash that adds complexity without value. Without access, all that data is like unmined iron that will never become tomorrow's steel beams and infrastructure in the knowledge economy. The former is slowly being dealt with, the latter is slowly being undermined by network operators and the culture of secrecy.

The trend toward capping Internet usage based on misleading pricing schemes, such as Comcast's 250GB monthly limit on data transfers over its network, erects barriers to participation in the knowledge economy. A single day's data output from the LHC or a protein-folding data set can exceed 250GB, and we've seen the benefits in distributed computing projects, such as "protein folding at home," in which spare bandwidth and compute cycles can be applied to solving complex problems.

But more problematic is the tendency to keep data sets secret. What many people don't realize when they say "Of course, companies need to keep data secret," is that the same data can be applied to many problems, most of which the company that owns the data isn't even seeking. For instance, a statistician can use large data sets from several different sources to test probabilistic formulas that can be applied in economics without encroaching on the research for which the data was originally intended. But there are also many cases where drug testing data includes pointers to unrecognized receptors in the human body that both the originator of data and the discoverer of the novel receptor can benefit from.

If we don't open this scientific boon to everyone, though, by making it easy and cheap to share data—the way the Web was designed to do—the benefits of the Petabyte Era are going to be limited, lop-sided and concentrated in a few centers of research rather than distributed throughout the economy.

Topics: CXO, Collaboration, Data Centers, Emerging Tech, IT Employment

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Definitely good for my business

    I'm quite sure that my employer will be more than happy to sell people tools for analyzing those massive datasets. In practice, secrecy (or rather, confidentiality) has its place (don't really want the general public poring over personal financial records, even though financial data are very useful to analyze), but not in government-funded scientific research. Taxpayers (rightly or wrongly) fund most of big science, so the results should be available to the general public for no more than the cost of publication, and be freely copyable and publishable by whomever wants to.
    John L. Ries
    • Confidentiality applies to personal data

      I don't think our lives should be thrown open to analysis --
      quite the opposite, but when it comes to creating the
      greatest possible economic benefit from research I think
      we're on the same page. Open access yields huge returns.
      Mitch Ratcliffe
      • The idea is not to analyze individuals...

        ...but groups in an effort to understand economic behavior (my boss is an econometrician). Individual records must still be kept in confidence (and are not really of interest to the analyst, except when taken together), but financial institutions (and other sorts of businesses) do analyze their customer records (often linked with publicly available data), build statistical models from them, and make decisions based on those models. How valid these models are and how appropriately they're used, are, of course, highly variable.

        There is, of course, no need to keep particle reaction data similarly confidential.
        John L. Ries
  • RE: LHC, Genomics and the era of massive data sets

    I hope they develop some sort of Folding@Home client to help with data from the LHC!
  • RE: LHC, Genomics and the era of massive data sets

    Thanks for bringing the issue of scientific progress into the debate on Deep Packet Inspection Services on the Internet, a subject at the core of the Telecoms Package review in the European Parliament. The vote is coming up soon, but unfortunately the link between scientific progress and distributed sharing of large sets of data has not been examined as to understand how, if at all, business models based on filtering should be mandated or not.
  • RE: LHC, Genomics and the era of massive data sets

    Why don??t beg for Miccrosoft open datas?...
    • Maybe...

      ...because MS is a private corporation, rather than a publicly-funded research institution. If the public pays for the research, they should have access to the research, unless there's a very good reason to deny it (military secrets would fall into that category).
      John L. Ries
  • RE: LHC, Genomics and the era of massive data sets

    There's an inherent problem with the era of massive data sets that is addressed in certain quite corners of 'Intelligent Design' structures... In that how they are constructed and sustained. That is why Intelligent Design is also called 'Complexity Theorem' ( placing the simplicity of creationism aside for the moment). In complexity structures build to a certain level of sustainable structures that then require even more complicated levels within ever growing demands for more and even more integrated systems for structural function. It is within these complex and continuous constructs that Intelligent Design draws upon its top down conclusions as each structure will continue to find evermore ingenious manner to create more sophisticated and interdependent structure. The only problem is found with the systemic nature of the landscape which will allow for only a certain fitness after which demands create stress and ultimately systemic failure. Then there is a collapse... Not by accident but rather by design. This is not a problem with 'Complexity Theorem' but rather a component. However since a structure of Intelligent Design cannot understand something more complicated than itself, it cannot determine the nature of its failure until it actually happens. This is how it proceeds to the next level of complexity. It looks as if we may not only have a problem with massive data but also useless data. In the current banking crisis some say its a problem of liquidity... Complexity would address this as the by product of useless functions that produce toxic securities which should be allowed to fail if the system is to survive at an optimum level. Throwing an infinite money or any amount of massive data simply will not work. The system is self correcting, as in the 'invisible hand' of Adam Smith... The solution then becomes self evident.' That is why Complexity Theory is often referred to as 'Self Organizing.'