Google: 'At scale, everything breaks'

Google: 'At scale, everything breaks'

Summary: Distinguished Google Fellow Urs Hölzle discusses the challenges of scaling Google's infrastructure, coping with cascading failures and the rise of flash storage in the modern datacentre


Google operates technology that is expected to be reliable in the face of major traffic demands.

To scale its services, the company has developed many systems, such as MapReduce and Google File System, that have since been made open source by Yahoo and worked into the popular Hadoop data-analytics framework.

However, behind the scenes, the company is fighting a constant battle against the twin demons of cascading failovers and the increasingly challenging levels of complexity that massively scaled services bring.

Urs Hölzle was Google's first vice president of engineering. Before joining Google he worked on high-performance implementations of object-orientated languages, contributed to Darpa's national compiler infrastructure project, and developed compilers for Smalltalk and Java.

According to Hölzle, "at scale, everything breaks", and Google must walk a tightrope between increasing the scaling of its systems while avoiding cascading failovers, such as the outage that affected Gmail in March this year.

Q: Apart from focusing on physical infrastructure, such as datacentres, are there efficiencies that Google gains from running software at massive scale?
A: I think there absolutely is a very large benefit there, probably more so than you can get from the physical efficiency. It's because when you have an on-premise server it's almost impossible to size the server to the load, because most servers are actually too powerful and most companies [using them] are relatively small.

[But] if you have a large-scale email service where millions of accounts are in one place, it's much easier to size the pool of servers to that load. If you aggregate the load, it's intrinsically much easier to keep your servers well utilised.

What are Google's plans for the evolution of its internal software tools?
There's obviously an evolution. For example, most applications don't use [Google File System (GFS)] today. In fact, we're phasing out GFS in favour of the next-generation file system that is very similar, but it's not GFS anymore. It scales better and has better latency properties as well. I think three years from now we'll try to retire that because flash memory is coming and faster networks and faster CPUs are on the way and that will change how we want to do things.

One of the nice things is that if everyone today is using the Bigtable compressed database, suppose we have a better Bigtable down the line that does the right thing with flash — then it's relatively easy to migrate all these applications as long as the API stays stable.

How significant is it to have these back-end systems — such as MapReduce and the Google File System — spawn open-source applications such as Hadoop through publication and adaptation by other companies?
It's an unavoidable trend in the sense that [open source] started with the operating system, which was the lowest level that everyone needed. But the power of open source is that you can continue to build on the infrastructure that already exists [and you get] things like Apache for the web server. Now we're getting into a broader range of services that are available through the cloud.

For instance, cluster management itself or some open-source version will happen, because everyone needs it as their computation scales and their issue becomes not the management of a single machine, but the management of a whole bunch of them. Average IT shops will have hundreds of virtual machines (VMs) or hundreds of machines they need to manage, so a lot of their work is about cluster management and not about the management of individual VMs.

Often, if computation is cheap enough, then it doesn't pay to...

Topic: Cloud

Jack Clark

About Jack Clark

Currently a reporter for ZDNet UK, I previously worked as a technology researcher and reporter for a London-based news agency.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Very good article, and insightful about the challenges Google faces. It also explains why the banks in Australia are experiencing a run of service failures in their systems, which has never happened before - because the complexity has got beyond them. It's not going to go away and the effort to fix it will be bigger then the effort it took to create the instability they currently have. With a skill base a fraction of Google's the Aussie banks are facing dire consequences from out of control complexity.

    Walter @adamson
  • A Plan Ahead for the Cloud

    IT pros should recognize in your analysis that you can manage, minimize, or mitigate risk, but not eliminate it. While “technical fixes” are wonderful when they work, they have limits. Part of the good engineering your organization needs is to recognize what lies beyond the bounds of engineering. If nothing else, rely on honesty, or its 21st century transform, “transparency”: there might come a time when you simply need to tell your customers that your plans didn’t allow for a backhoe accident, a truckers’ strike, an atypical hurricane, and an influenza epidemic all happening in the same week.

    Managing computing systems never give much opportunity for rest, simply because change is so rapid; even when you figure out the right answer today, technical advances can change everything by tomorrow. What you can do, however, especially when deciding how much of your own business to push into The Cloud, is to think clearly about your unique situation and requirements, design solutions that are right for you, understand clearly how complex systems fail, and have good recovery plans in place. Every one of these investments is “multi-purpose”: they pay off whatever you decide with and experience in The Cloud.