'Let's try and not have a human do it': How one Facebook techie can run 20,000 servers

Facebook on how it has managed to ease the pain of running one of the world's largest fleet of servers.
Written by Nick Heath, Contributor on

How many people does it take to run 20,000 servers? In the case of Facebook just the one.

Facebook's low overhead is born of necessity. When you run a computing infrastructure supporting the Likes of more than one billion users worldwide – the less human input needed to keep it ticking over, the better.

The philosophy of the social media giant is if a job doesn't need to be carried out by a person, "let's try and not have a human do it", said Tom Furlong, Facebook's director of site operations.

Facebook cuts manual maintenance of its IT infrastructure wherever possible. Servers and drives can be replaced without tools thanks to their Open Compute Project (OCP) designs, a system called Cyborg fixes tracts of misbehaving servers automatically and tools like Chef help manage thousands of servers at a time.

One of the latest of these manual labour-saving tools is Cluster Planner. Facebook regularly deploys servers in their thousands to meet the constantly changing demands of its business. Cluster Planner helps the company find the best home for these clusters within its global datacentre estate.

"Cluster Planner helps us calculate where we're going to put machines, so we can best utilise the infrastructure that we have," said Furlong.

"It basically replaces an incredibly manual process," he said, adding it had reduced cluster deployment time from days to hours.

Cluster Planner takes a snapshot of what sits where in Facebook's estate. It aggregates data about Facebook's computer systems – such as server components, average CPU utilisation and number of IOPs – and details about the datacentres, including electrical capacity and available floor space.

"It's normally a very cumbersome process for engineers to do the math calculations. With Cluster Planner you can do that very quickly, so what took us days before now takes us hours."

A core challenge for Facebook when cataloguing infrastructure spread across its global datacentres is ensuring consistency in how the data is collected and labelled, so Facebook can properly aggregate data. For example ensuring that data derived from generators at its datacentre in Prineville, Oregon is labelled in the same way as that collected from generators at its Forest City installation, more than 2,500 miles away.

Facebook's Prineville datacentre. Image: Facebook

Cluster Planner is in beta, its tools are being worked on and Facebook is in the process of hooking it into all of its datacentres.

Keeping track of Swiss cheese clusters

The make-up of Facebook's datacentres is moulded by the ebb and flow of business demands, with its computing estate changing shape to support alterations to systems and the roll out of new software. Furlong believes Facebook's IT operations could become more efficient by using systems like Cluster Planner to capture information about the composition of server clusters as they change during their lifetime.

"You don't put a server in production and then have [it handling] exactly the same workload during its three year lifespan. It's a moving landscape," said Furlong.

"Over the lifespan of the cluster you start to see things disappearing from it. You get people saying 'Oh there are underutilised resources in that cluster, let's put them in that new cluster that we're bringing up'.

"After a while there's like this 'Swiss cheese' [cluster], which I know by looking at it doesn't have enough machines in it to be used efficiently.

"If we had something that allowed us to better calculate what that 'Swiss cheese' is, we could go back to capacity planning and say 'Hey send us some more stuff' or 'How about we compact this and use some free rows for other things'. Those are the types of things I want us to be able to do over the lifetime of the equipment, so we don't have this static view of the world."

Automating away the pain

Maintaining one of the world's largest fleet of servers could be a massive hassle.

Facebook eases that pain in several ways, such as by using OCP-designed servers, storage and datacentre equipments that have been stripped down to the core components needed to carry out specific computing workloads, an approach that the OCP refers to as 'vanity-free design'.

These designs not only remove unnecessary screws from servers and drives so they can be installed without tools, but also reduce the number of components that can go wrong. The company estimates that using OCP hardware has cut the time its staff spend on maintenance jobs by more than 50 percent.

Also cutting the number of manual repairs are automated software tools like the aforementioned Cyborg, which monitor Facebook's infrastructure and try to correct problems without manual intervention.

"We've said that one datacentre technician can basically handle about 20,000 machines," Furlong said.

"One of the reasons why we can operate as efficiently as we do in terms of server repairs is we have huge numbers of automated systems that collect and analyse data.

"Cyborg goes out and tries to do its own server repairs, like a soft reboot or something that is an easy way to fix a system that's hung up. Cyborg keeps literally thousands of potential tickets from ever hitting a human hand.

"The more stuff we can have a system do, the better. I don't want to employ armies of people to do the things I can automate out. I want the people who are there to do high value work, the more complex repairs, the installs and de-installs, and all the kind of things that you need people to do."

Furlong is also examining ways to cut the manual work needed to add new hardware to Facebook's inventory system, and is considering automated alternatives to its current system of adding items by manually scanning barcodes.

"Say we have a setup where you just snap the new thing in and the system recognises it then updates all of your inventory systems," he said.

While the scale of Facebook's IT operations can be difficult to manage, its size can be an asset when it comes to weeding out problems in the infrastructure. Furlong recalls the company spotting a recurring fault with a small number of motherboards in its OCP servers when they were installed in its Prineville datacentre.

"What we realised was we actually had the same failures in all of our legacy equipment, we just couldn't see it before because the sample sizes were small," he said.

"That's the beauty of having a large sample size to look at, you can knock things out that keep you from having to do a whole bunch of work [in the future]."

If Facebook is to continue to grow its 1.1 billion-strong user base it will need to continue work at sweating its datacentres, but Furlong believes there are plenty more efficiencies to wring out.

"To me that is the Holy Grail of this, to look at that server to datacentre ratio and be able to say 'I can get more servers into my datacentre'," he said.

"I think we do a good job of that today but I absolutely believe there is room for improvement."

Editorial standards