- Over four billion Web pages, each an average of 10KB, all fully indexed
- Up to 2,000 PCs in a cluster
- Over 30 clusters
- 104 interface languages including Klingon and Tagalog
- One petabyte of data in a cluster -- so much that hard disk error rates of 10-15 begin to be a real issue
- Sustained transfer rates of 2Gbps in a cluster
- An expectation that two machines will fail every day in each of the larger clusters
- No complete system failure since February 2000
It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we're not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists.
And it is all hidden behind a deceptively simple, white, Web page that contains a single one-line text box and a button that says Google Search.
When Arthur C. Clarke said that any sufficiently advanced technology is indistinguishable from magic, he was alluding to the trick of hiding the complexity of the job from the audience, or the user. Nobody hides the complexity of the job better than Google does; so long as we have a connection to the Internet, the Google search page is there day and night, every day of the year, and it is not just there, but it returns results. Google recognises that the returns are not always perfect, and there are still issues there -- more on those later -- but when you understand the complexity of the system behind that Web page you may be able to forgive the imperfections. You may even agree that what Google achieves is nothing short of sorcery.
On Thursday evening, Google's vice-president of engineering, Urs Hölzle, who has been with the company since 1999 and who is now a Google fellow, gave an insight to would-be Google employees into just what it takes to run an operation on such a scale, with such reliability. ZDNet UK snuck in the back to glean some of the secrets of Google's magic.
Google's vision is broader than most people imagine, said Hölzle: "Most people say Google is a search engine but our mission is to organise information to make it accessible."
Behind that, he said, comes a vast scale of computing power based on cheap, no-name hardware that is prone to failure. There are hardware malfunctions not just once, but time and time again, many times a day.
Yes, that's right, Google is built on imperfect hardware. The magic is writing software that accepts that hardware will fail, and expeditiously deals with that reality, says Hölzle.
Google indexes over four billion Web pages, using an average of 10KB per page, which comes to about 40TB. Google is asked to search this data over 1,000 times every second of every day, and typically comes back with sub-second response rates. If anything goes wrong, said Hölzle, "you can't just switch the system off and switch it back on again."
The job is not helped by the nature of the Web. "In academia," said Hölzle, "the information retrieval field has been around for years, but that is for books in libraries. On the Web, content is not nicely written -- there are many different grades of quality."
Some, he noted, may not even have text. "You may think we don't need to know about those but that’s not true -- it may be the home page of a very large company where the Webmaster decided to have everything graphical. The company name may not even appear on the page."
Google deals with such pages by regarding the Web not as a collection of text documents, but a collection of linked text documents, with each link containing valuable information.
"Take a link pointing to the Stanford university home page," said Hölzle. "This tells us several things: First, that someone must think pointing to Stanford is important. The text in the link also gives us some idea of what is on the page being pointed to. And if we know something about the page that contains the link we can tell something about the quality of the page being linked to."
This knowledge is encapsulated in Google's famous PageRank algorithm, which looks not just at the number of links to a page but at the quality or weight of those links, to help determine which page is most likely to be of use, and so which is presented at the top of the list when the search results are returned to the user. Hölzle believes the PageRank algorithm is 'relatively' spam resistant, and those interested in exactly how it works can find more information here.