CERN steps up to massive storage challenge

A hotbed of scientific endeavour, CERN deals with petabytes of data. Faced with this challenge, it has turned to a combination of x86 commodity systems and Linux to deal with the deluge
Written by Tom Espiner, Contributor

Researchers at CERN, the world's largest particle physics laboratory, face a truly immense storage challenge.

One of its latest projects, the Large Hadron Collider (LHC), is being built to study particles and the forces that bind them together. Due to become fully operational around September 2007, the LHC will fire billions of protons round a 27km circuit, 150m below ground.

Each beam fires 3,000 bundles of 100 billion protons, whose paths are bent round the circuit by supercooled (-271°C) superconducting magnets, and are made to collide at the centre of four detectors in the tunnel. The interactions between the protons are measured there at 40 million events per second.

In short, this means that CERN's scientists have an awful lot of data on their hands. They use computers to filter the events down to a few hundred "good" events per second, but even this can generate between 100 and 1,000 megabytes of data per second.

That equates to 15 petabytes of data per year for four experiments, which will be stored on magnetic tape and disk.

"This is far too large for a single datacentre," said Dr Helge Meinhard, technical coordinator for CERN-IT Switzerland. "The information is federated to more than 120 datacentres worldwide."

The processing power currently required by CERN is equivalent to 30,000 CPU servers, Meinhard told ZDNet UK, speaking at the Storage Networking World event in Frankfurt.

Experimental event data is sent via optical links to CERN computer centres. One data stream is stored on magnetic tape, one data stream is sent to one or two of CERN's 11 "Tier 1" centres, while a third data stream is sent to CERN's CPUs for analysis and to map the particle events.

This network is dubbed the DataGRID, and CERN's scientists will be able to access data from anywhere on the network.

Storage is made more complex by each centre being autonomous, although there are commonalities. All the centres use x86 architecture, and Linux. CERN uses x86 and Linux on 98 percent of its systems, according to Meinhard.

"The main reason is cost," said Meinhard. "It gives us the best value for money. You don't have to pay per machine, which is a significant advantage."

Another CERN scientist, who preferred not to be named, said that it wouldn't be possible to fund CERN projects if they had to rely on proprietary software, because of the cost of licensing.

CERN physicists also keep costs down by developing their own "homemade" software, and relying on commodity or off-the-shelf equipment as far as possible.

With the collisions beginning in earnest in the LHC by late summer 2007, the physicists hope to find the Higgs boson, a hypothetical elementary particle.

American scientists working on the LHC project got a boost last week when two high-speed networks, ESNet and Internet2, announced they would work together to develop a "highly reliable, high-capacity network" across the US.

Editorial standards