Google operates technology that is expected to be reliable in the face of major traffic demands.
To scale its services, the company has developed many systems, such as MapReduce and Google File System, that have since been made open source by Yahoo and worked into the popular Hadoop data-analytics framework.
However, behind the scenes, the company is fighting a constant battle against the twin demons of cascading failovers and the increasingly challenging levels of complexity that massively scaled services bring.
Urs Hölzle was Google's first vice president of engineering. Before joining Google he worked on high-performance implementations of object-orientated languages, contributed to Darpa's national compiler infrastructure project, and developed compilers for Smalltalk and Java.
According to Hölzle, "at scale, everything breaks", and Google must walk a tightrope between increasing the scaling of its systems while avoiding cascading failovers, such as the outage that affected Gmail in March this year.
Q: Apart from focusing on physical infrastructure, such as datacentres, are there efficiencies that Google gains from running software at massive scale?
A: I think there absolutely is a very large benefit there, probably more so than you can get from the physical efficiency. It's because when you have an on-premise server it's almost impossible to size the server to the load, because most servers are actually too powerful and most companies [using them] are relatively small.
[But] if you have a large-scale email service where millions of accounts are in one place, it's much easier to size the pool of servers to that load. If you aggregate the load, it's intrinsically much easier to keep your servers well utilised.
What are Google's plans for the evolution of its internal software tools?
There's obviously an evolution. For example, most applications don't use [Google File System (GFS)] today. In fact, we're phasing out GFS in favour of the next-generation file system that is very similar, but it's not GFS anymore. It scales better and has better latency properties as well. I think three years from now we'll try to retire that because flash memory is coming and faster networks and faster CPUs are on the way and that will change how we want to do things.
One of the nice things is that if everyone today is using the Bigtable compressed database, suppose we have a better Bigtable down the line that does the right thing with flash — then it's relatively easy to migrate all these applications as long as the API stays stable.
How significant is it to have these back-end systems — such as MapReduce and the Google File System — spawn open-source applications such as Hadoop through publication and adaptation by other companies?
It's an unavoidable trend in the sense that [open source] started with the operating system, which was the lowest level that everyone needed. But the power of open source is that you can continue to build on the infrastructure that already exists [and you get] things like Apache for the web server. Now we're getting into a broader range of services that are available through the cloud.
For instance, cluster management itself or some open-source version will happen, because everyone needs it as their computation scales and their issue becomes not the management of a single machine, but the management of a whole bunch of them. Average IT shops will have hundreds of virtual machines (VMs) or hundreds of machines they need to manage, so a lot of their work is about cluster management and not about the management of individual VMs.
Often, if computation is cheap enough, then it doesn't pay to...
...do your own solution. Maybe you can do your own solution but you [might not be able to] justify the software engineering effort and then the ongoing maintenance, instead of staying within an open-source system.
In your role, what is the most captivating technical problem you deal with?
I think the big challenges haven't changed that much. I'd say that it's dealing with failure, because at scale everything breaks no matter what you do and you have to deal reasonably cleanly with that and try to hide it from the people actually using your system.
At scale everything breaks no matter what you do and you have to deal reasonably cleanly with that and try to hide it from the people actually using your system.
There are two big reasons why MapReduce-Hadoop is really popular. One is that it gets rid of your parallelisation problem because it parallelises automatically and does load-balancing automatically across machines. But the second one is if you have a large computation, it deals with failures. So if one of [your] machines dies in the middle of a 10-hour computation, then you're fine. It just happens.
I think the second one is dealing with stateful, mutable states. MapReduce is easy because it's a case of presenting it with a number of files and having it compute them and, if things go wrong, you can just do it again. But Gmail, IM and other stateful services have very different security [uptime and data-loss] implications.
We use tapes, still, in this age because they're actually a very cost-effective way as a last resort for Gmail. The reason why we put it in is not physical data loss, but once in a blue moon you will have a bug that destroys all copies of the online data and your only protection is to have something that is not connected to the same software system, so you can go and redo it.
The last challenge we're seeing is to use commodity hardware but actually make it work in the face of rapid innovation cycles. For example, here's a new HDD (hard-disk drive). There's a lot of pressure in the market to get it out because you want to be the first one with a 3TB drive and there's a lot of cost pressure to, but how do you actually make these drives reliable?
As a large-scale user, we see all the corner cases and in virtually every piece of hardware we use we find bugs, even if it's a shipping piece of hardware.
If you use the same operating system, like Linux, and run the same computation on 10,000 machines and every day 100 of them fail, you're going to say, wow this is wrong. But if you did it by yourself, it's a one-percent failure rate. So three times a year you'd have to change your server. You probably wouldn't take the effort to debug and you'd think it was a random fluke or you'd debug and it wouldn't be happening any more.
It seems you want all your services to speak to each other. But surely this introduces its own problems of complexity?
Automation is key, but it's also dangerous. You can shut down all machines automatically if you have a bug. It's one of the things that is very challenging to do because you want uniformity and automation, but at the same time you can't really automate everything without lots of safeguards or you get into cascading failures.
Keeping things simple and yet scalable is actually the biggest challenge.
Complexity is evil in the grand scheme of things because it makes it possible for these bugs to lurk that you see only once every two or three years, but when you see them it's a big story because it had a large, cascading effect.
Keeping things simple and yet scalable is actually the biggest challenge. It's really, really hard. Most things don't work that well at scale, so you need to introduce some complexity, but you have to keep it down.
Have you looked into some of the emerging hardware, such as PCIe-linked flash?
We're not a priori excluding anything and we're playing with things all the time. I would expect PCIe flash to become a commodity because it's a pretty good way of exposing flash to your operating system. But flash is still tricky because the durability is not very good.
I think these all have the promise of deeply affecting how applications are written because if you can afford to put most of your data in a storage medium like this rather than on a moving head, then a factor of a thousand makes a huge difference. But these things are not close enough yet to disk in terms of storage cost.