Inside Facebook's lab: A mission to make hardware open source

Summary:A look behind the scenes of Facebook's hardware lab, the spiritual home of the Open Compute datacentre hardware movement, which may radically change the type of IT enterprises use, and who they buy it from.

Amir Michael, Facebook's manager of system engineering, is stood in the company's hardware lab trying not to get in the way of the assorted engineers, wheelie-chairs and bottles of water scattered around the room, describing Facebook's attempt to democratise hardware.

"We're trying to take away a lot of the uniqueness of server design" to create a "clean, open canvas" for companies to base their datacentres around, he explains.

What Amir is talking about are the server and storage systems that Facebook uses in its datacentres and how the social-networking leader is hoping that by publishing the designs and specifications of this low-power, low-cost hardware, it can reduce the cost of infrastructure for businesses large and small. 

Facebook lab shot
In Facebook's lab, the company is trying to reinvent storage (left) and compute (far right). Image: Jack Clark

The equipment in the lab — a novel Open Vault storage array and various versions of the Open Compute server — is being developed by Facebook as part of its Open Compute Project, a cross-industry scheme by the company to bring an open-source approach to physical hardware. 

The Open Compute Project was launched by Facebook in April 2011  as a way of distributing its server designs, but in an attempt to seek broader participation in the scheme, the company span the project off into its own Foundation in October 2011

Facebook remains the initiative's de facto leader: its vice president of hardware design and supply chain operations, Frank Frankovsky, is the chairman of its board of directors. That said, the rest of the board are from major enterprises such as Intel, Rackspace, Arista Networks and Goldman Sachs. If these companies are involved in this scheme, you can assume that the Open Compute approach is something that both IT buyers and IT sellers think is worth a bet.

Lifting the industry

"Our goal is to be non-proprietary," Matt Corddry, a senior manager of hardware engineering at Facebook, said during my recent visit to the lab. "We're not trying to maintain an advantage with this gear, we're trying to elevate the industry."

This approach contrasts with other large cloud operators. Google, Amazon and Microsoft are all notoriously secretive about their datacentre infrastructure, though Google occasionally releases research papers outlining some of its more advanced software systems

"Our goal is to be non-proprietary. We're not trying to maintain an advantage with this gear, we're trying to elevate the industry" — Matt Corddry, Facebook

The Open Compute scheme has received broad industry interest, with both AMD and Intel contributing motherboard designs and CAD documents. Facebook thinks that in time, its Open Compute designs could shake up the enterprise IT landscape. 

"What I see happening is a lot of these principles that we've shared will start to take root in enterprise systems as well," Michael said. "The server can be lightweight; it can be vanity-free."

In fact, the Open Compute Foundation says it believes upcoming Open Compute motherboards designed by Intel (codenamed 'Decathlete') and AMD ('Roadrunner') could, in time, become "a universal motherboard, in terms of functionality, supporting 70 to 80 percent of target enterprise infrastructure use cases".

As for the hardware itself, both the Open Compute server and storage equipment is designed differently to the types of gear being made by enterprise vendors such as HP, IBM and Dell.

Sled servers lead the way 

Open Compute Server Version 2
The second generation of the Open Compute servers integrate an air duct into the chassis. Image: Jack Clark

The servers, (pictured), are based on a sled chassis design that is designed to work with Facebook's Open Rack. This is a new approach to server rack design that seeks to distribute equipment typically found on servers — power systems, networking and so on — and plug it into the rack itself. 

They take power in from a power distribution system that lives in a portion of the rack, rather than the server, and the drives are situated at the front to make it simpler to swap them out if they fail. 

The prototype servers (pictured) are version 2 of the Open Compute specification.

The major differences in the new compute server compared with its predecessor are a move to a single motherboard per chassis, larger fans (now 80mm, up from 60mm) that consume less power, and the incorporation of an air duct in the server sled's chassis. This means Facebook can save on the cost of building plastic air ducts then fitting them to its servers. 

In the future, Facebook hopes to entirely remove the drive from the web servers and boot off a low-power, more-reliable mSATA solid-state drive. A 60GB drive should be sufficient to host Facebook's OS and its logs.

mSATA drives are typically used in laptops, but Michael's team has built an adapter that lets Facebook use them in servers.

Corddry is keen on this, as it lets Facebook obtain cost savings from a "really high volume commodity part", he said, noting that "you don't need enterprise-grade equipment to boost a web server".

Open Vault storage push

The other major project the Facebook hardware labs team is working on is a way of redesigning storage arrays to suit large-scale datacentres.

Open Vault 'Knox' storage
The Open Vault storage systems can selectively cut power to rarely used storage, saving power. Image: Jack Clark

The Open Vault equipment, codenamed Knox (pictured), packs multiple hard drives onto a retractable sled. This can be pulled out and then, using a hinge, lowered to allow engineers to easily swap drives out in case of failures. 

Facebook has a constant backlog of equipment that needs maintenance, Michael said, with a rough annualised failure rate of about one percent. For this reason, making it easy to maintain kit and swap out failed drives has become a priority. 

Knox has a feature that lets it cut the power to individual drives when they are not being used, and differing numbers of drives can be attached to each motherboard according to the processing needs of the storage server.

Sometimes it really is a simple matter of turning it off and on again, according to Facebook's team

This gives Facebook two useful features. To start with, it can give power to its 30-odd drives according to the frequency with which their data is accessed. In other words, regularly accessed information can be kept on drives that are always switched on, while rarely touched data can be put on drives that are by default powered-down and only switched on when an access request is made. That lets the company save on power.

Another benefit is that it gives the company a way to solve hardware problems. 

"Drives... actually fail the most in our datacentre," Michael said. "Part of our procedure is when a drive fails we try and power cycle it."

Yes — sometimes it really is a simple matter of turning it off and on again, according to Facebook's team. "A lot of drive manufacturers get returns with no trouble found," Corddry said. 

An additional benefit of Knox is that its design makes it relatively easy to manipulate the proportion of storage assigned to...

Topics: Data Centers, Cloud, Hardware, Open Source, Social Enterprise


Jack Clark has spent the past three years writing about the technical and economic principles that are driving the shift to cloud computing. He's visited data centers on two continents, quizzed senior engineers from Google, Intel and Facebook on the technologies they work on and read more technical papers than you care to name on topics f... Full Bio

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.