The numbers alone are enough to make your eyes water.
- Over four billion Web pages, each an average of 10KB, all fully indexed
- Up to 2,000 PCs in a cluster
- Over 30 clusters
- 104 interface languages including Klingon and Tagalog
- One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue
- Sustained transfer rates of 2Gbps in a cluster
- An expectation that two machines will fail every day in each of the larger clusters
- No complete system failure since February 2000
It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists.
And it is all hidden behind a deceptively simple, white, Web page that contains a single one-line text box and a button that says Google Search.
When Arthur C. Clarke said that any sufficiently advanced technology is indistinguishable from magic, he was alluding to the trick of hiding the complexity of the job from the audience, or the user. Nobody hides the complexity of the job better than Google does; so long as we have a connection to the Internet, the Google search page is there day and night, every day of the year, and it is not just there, but it returns results. Google recognises that the returns are not always perfect, and there are still issues there -- more on those later -- but when you understand the complexity of the system behind that Web page you may be able to forgive the imperfections. You may even agree that what Google achieves is nothing short of sorcery.
On Thursday evening, Google's vice-president of engineering, Urs Hölzle, who has been with the company since 1999 and who is now a Google fellow, gave an insight to would-be Google employees into just what it takes to run an operation on such a scale, with such reliability. ZDNet UK snuck in the back to glean some of the secrets of Google's magic.
Google's vision is broader than most people imagine, said Hölzle: "Most people say Google is a search engine but our mission is to organise information to make it accessible."
Behind that, he said, comes a vast scale of computing power based on cheap, no-name hardware that is prone to failure. There are hardware malfunctions not just once, but time and time again, many times a day.
Yes, that's right, Google is built on imperfect hardware. The magic is writing software that accepts that hardware will fail, and expeditiously deals with that reality, says Hölzle.
Google indexes over four billion Web pages, using an average of 10KB per page, which comes to about 40TB. Google is asked to search this data over 1,000 times every second of every day, and typically comes back with sub-second response rates. If anything goes wrong, said Hölzle, "you can't just switch the system off and switch it back on again."
The job is not helped by the nature of the Web. "In academia," said Hölzle, "the information retrieval field has been around for years, but that is for books in libraries. On the Web, content is not nicely written -- there are many different grades of quality."
Some, he noted, may not even have text. "You may think we don't need to know about those but that’s not true -- it may be the home page of a very large company where the Webmaster decided to have everything graphical. The company name may not even appear on the page."
Google deals with such pages by regarding the Web not as a collection of text documents, but a collection of linked text documents, with each link containing valuable information.
"Take a link pointing to the Stanford university home page," said Hölzle. "This tells us several things: First, that someone must think pointing to Stanford is important. The text in the link also gives us some idea of what is on the page being pointed to. And if we know something about the page that contains the link we can tell something about the quality of the page being linked to."
This knowledge is encapsulated in Google's famous PageRank algorithm, which looks not just at the number of links to a page but at the quality or weight of those links, to help determine which page is most likely to be of use, and so which is presented at the top of the list when the search results are returned to the user. Hölzle believes the PageRank algorithm is 'relatively' spam resistant, and those interested in exactly how it works can find more information here.
Obviously it would be impractical to run the algorithm once every page for every query, so Google splits the problem down.
When a query comes in to the system it is sent off to index servers, which contain an index of the Web. This index is a mapping of each word to each page that contains that word. For instance, the word 'Imperial' will point to a list of documents containing that word, and similarly for 'College'. For a search on 'Imperial College' Google does a Boolean 'AND' operation on the two words to get a list of what Hölzle calls 'word pages'.
"We also consider additional data, such as where in the page does the word occur: in the title, the footnote, is it in bold or not, and so on.
Each index server indexes only part of the Web, as the whole Web will not fit on a single machine - certainly not the type of machines that Google uses. Google's index of the Web is distributed across many machines, and the query gets sent to many of them - Google calls each on a shard (of the Web). Each one works on its part of the problem.
Google computes the top 1000 or so results, and those come back as document IDs rather than text. The next step is to use document servers, which contain a copy of the Web as crawled by Google's spiders. Again the Web is essentially chopped up so that each machine contains one part of the Web. When a match is found, it is sent to the ad server which matches the ads and produces the familiar results page.
Google's business model works because all this is done on cheap hardware, which allows it to run the service free-of-charge to users, and charge only for advertising.
"Even though it is a big problem", said Hölzle, "it is tractable, and not just technically but economically too. You can use very cheap hardware, but to do this you have to have the right software."
Google runs its systems on cheap, no-name IU and 2U servers -- so cheap that Google refers to them as PCs. After all each one has a standard x86 PC processor, standard IDE hard disk, and standard PC reliability - which means it is expected to fail once in three years.
On a PC at home, that is acceptable for many people (if only because they're used to it), but on the scale that Google works at it becomes a real issue; in a cluster of 1,000 PCs you would expect, on average, one to fail every day. "On our scale you cannot deal with this failure by hand," said Hölzle. "We wrote our software to assume that the components will fail and we can just work around it. This software is what makes it work.
One key idea is replication. "This server that contains this shard of the Web, let's have two, or 10," said Hölzle. "This sound sounds expensive, but if you have a high-volume service you need that replication anyway. So you have replication and redundancy for free. If one fails you have 10 percent reduction in service so no failures so long as the load balancer works. So failure becomes and a manageable event."
In reality, he said, Google probably has "50 copies of every server". Google replicates servers, sets of servers and entire data centres, added Hölzle, and has not had a complete system failure since February 2000. Back then it had a single data centre, and the main switch failed, shutting the search engine down for an hour. Today the company mirrors everything across multiple independent data centres, and the fault tolerance works across sites, "so if we lose a data centre we can continue elsewhere -- and it happens more often than you would think. Stuff happens and you have to deal with it."
A new data centre can be up and running in under three days. "Our data centre now is like an iMac," said Schulz." You have two cables, power and data. All you need is a truck to bring the servers in and the whole burning in, operating system install and configuration is automated."
Working around failure of cheap hardware, said Hölzle, is fairly simple. If a connection breaks it means that machine has crashed so no more queries are sent to it. If there is no response to a query then again that signals a problem, and it can cut it out of the loop.
That is redundancy taken care of, but what about scaling? The Web grows every year, as do the number of people using it, and that means more strain on Google's servers.
Google has two crucial factors in its favour. First, the whole problem is what Hölzle refers to as embarrassingly parallel, which means that if you double the amount of hardware, you can double performance (or capacity if you prefer -- the important point is that there are no diminishing returns as there would be with less parallel problems).
The second factor in Google's favour is the falling cost of hardware. If the index size doubles, then the embarrassingly parallel nature of the problem means that Google could double the number of machines and get the same response time so it can grow linearly with traffic. "In reality (from a business point of view) we would like to grow less than linear to keep costs down," said Hölzle, "but luckily the hardware keeps getting cheaper."
So every year as the Web gets bigger and requires more hardware to index, search and return Web pages, hardware gets cheaper so it "more or less evens out" to use Hölzle's words.
As the scale of the operation increases, it introduces some particular problems that would not be an issue on smaller systems. For instance, Google uses IDE drives for all its storage. They are fast and cheap, but not highly reliable. To help deal with this, Google developed its own file system -- called the Google File System, or GFS -- which assumes an individual unit of storage can go away at any time either because of a crash, a lost disk or just because someone stepped on a cable.
There are no disk arrays within individual PCs; instead Google stores every bit of data in triplicate on three machines on three racks on three data switches to make sure there is no single point of failure between you and the data. "We use this for hundreds of terabytes of data," said Hölzle.
Don't expect to see GFS on a desktop near you any time soon - it is not a general-purpose file system. For instance, a GFS block size is 64MB, compared to the more usual 2KB on a desktop file system. Hölzle said Google has 30 plus clusters running GFS, some as large as 2,000 machines with petabytes of storage. These large clusters can sustain read/write speeds of 2Gbps - a feat made possible because each PC manages 2Mbps.
Once, said Hölzle, "someone disconnected an 80-machine rack from a GFS cluster, and the computation slowed down as the system began to re-replicate and we lost some bandwidth, but it continued to work. This is really important if you have 2,000 machines in a cluster." If you have 2000 machines then you can expect to see two failures a day.
The problem |
The process |
The hardware Other technical challenges
The scalability |
Running thousands of cheap servers with relatively high failure rates is not an easy job. Standard tools don't work at this scale, so Google has had to develop them in-house. Some of the other challenges the company continues to face include:
Debugging: "You see things on the real site you never saw in testing because some special set of circumstances that create a bug," said Hölzle. "This can create non-trivial but fun problems to work on."
Data errors: A regular IDE hard disk will have an error rate in the order of 10-15 -- that is one millionth of one billionth of the data written to it may get corrupted and the hard-disk's own error checking will not pick it up. "But when you have a petabyte of data you need to start worrying about these failures," said Hölzle. "You must expect that you will have undetected bit errors on your disk several times a month, even with hardware checking built-in, so GFS does have an extra level of checksumming. Again this is something we didn’t expect, but things happen."
Spelling: Google wrote its own spell checker, and maintains that nobody know as many spelling errors as it does. The amount of computing power available at the company means it can afford to begin teaching the system which words are related - for instance "Imperial", "College" and "London". It's a job that many CPU years, and which would not have been possible without these thousands of machines. "When you have tons of data and tons of computation you can make things work that don’t work on smaller systems," said Hölzle. One goal of the company now is to develop a better conceptual understanding of text, to get from the text string to a concept.
Power density: "There is an interesting problem when you use PCs," said Hölzle. "If you go to a commercial data centre and look at what they can support, you'll see a typical design allowing for 50W to 100W per square foot. At 200W per square foot you notice the sales person still wants to sell it but their international tech guy starts sweating. At 300W per square foot they cry out in pain."
Eighty mid-range PCs in a rack, of which you will find many dozens in a Google data centre, produce over 500W per square foot. "So we're not going to blade technology," said Hölzle. "We're already too dense. Finally Intel has realised this is a problem and is now focusing more on power efficiency, but it took some time to get the message across."
Quality of search results: One big area of complaints for Google is connected to the growing prominence of commercial search results -- in particular price comparison engines and e-commerce sites. Hölzle is quick to defend Google's performance "on every metric", but admits there is a problem with the Web getting, as he puts it, "more commercial". Even three years ago, he said, the Web had much more of a grass roots feeling to it. "We have thought of having a button saying 'give me less commercial results'," but the company has shied away from implementing this yet.