Two Google Fellows just published a paper in the latest issue of Communications of the ACM about MapReduce, the parallel programming model used to process more than 20 petabytes of data every day on commodity-based clusters. The latest statistics included in the paper are from September 2007. They show that MapReduce-based programs accounted for 2.2 million jobs using more than 11,000 machine years. In other words, this programming model alone uses the power of more than 130,000 commodity PCs. It's also interesting to note that this programming model is currently being used in more than 10,000 programs for applications such as large-scale graph processing, text processing or data mining. But read more...
You can see above a figure showing the overall flow of a MapReduce operation in the Google implementation when a user program calls the MapReduce function. "One of the copies of the program -- the master -- is special. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. [...] When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code." (Credit: Google)
The figure above and its caption were extracted from the paper published in Communications of the ACM, "MapReduce: simplified data processing on large clusters" (Volume 51, Issue 1, Pages 107-113, January 2008). This paper has been written by two Google Fellows, Jeffrey Dean and Sanjay Ghemawat. Here is a link to this article (PDF format, 7 pages, 417 KB).
But what exactly is MapReduce? It "is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model."
And here is the text of the abstract of the 2008 paper. "MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day."
Even the authors are surprised by the success of this programming model inside Google. The first MapReduce library was written in February 2003. In the last 5 years, it has been used across a wide range of domains within Google, including:
- large-scale machine learning problems
- clustering problems for the Google News and Froogle products
- extracting data to produce reports of popular queries (e.g. Google Zeitgeist and Google Trends)
- extracting properties of Web pages for new experiments and products
- processing of satellite imagery data
- language model processing for statistical machine translation
- large-scale graph computations
As you can read the full version of the paper, I just want to mention the first part of its conclusions. "The MapReduce programming model has been successfully used at Google for many different purposes. We attribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems. Third, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google."
For more information, you can read several other documents available from Google.
- a previous paper from 2004 carrying the exact same name than the 2008 one, "MapReduce: Simplified Data Processing on Large Clusters" (links to the abstract and to the full article (PDF format, 11 pages, 186 KB))
- Introduction to Parallel Programming and MapReduce
- MapReduce in a Week
Sources: Communications of the ACM, January 2008; and various websites
You'll find related stories by following the links below.