In my last post, I explained MapReduce in terms of a hypothetical exercise: counting up all the smartphones in the Empire State Building. My idea was to have the fire wardens count up the number of smartphones, by platform, in each suite on the warden's floor. For each suite, they'd write the smartphone platform name and number of handsets on a separate sheet of paper. They'd put all the sheets in an envelope, and drop the envelope in the mail chute. Down in the lobby, I'd get all the envelopes out of the mailboxes, open them, sort the sheets by platform name into separate piles, then stuff the piles into new envelopes and send a couple each to the fire wardens on the 10th, 20th and 30th floors. The fire wardens on those three floors would enter final tallies for each platform they had sheets for onto a new sheet of paper. Back in my office, the three final tally sheets' data would then get entered into a spreadsheet and I'd have some valuable data, given that the Empire State Building has enough people to warrant its own ZIP code.
In this post, I want to correlate some of the actors and objects in the skyscraper scenario to MapReduce vocabulary. If you can follow the last post and this one, you'll understand MapReduce pretty well, all without getting bogged down in lines of code. And if you can do that, Big Data will make a lot more sense.
In our skyscraper analogy, the work the fire wardens did in getting the per-suite, per-platform handset count would be the Map step in our job, and the work the fire wardens on the 10th, 20th and 30th floors did, in calculating their final platform tallies, would be the Reduce step. A Map step and a Reduce step constitute a MapReduce job. Got that?
Let's keep going. For the building, the collection of suite numbers and smartphone types in each would represent the keys and values in our input file. We split/partitioned that file into a smaller one for each floor which, just like the original input file, would have suite number as the key and smartphone platform data as the value. For each suite, our mappers (the fire wardens on each floor) created output data with the smartphone platform name as key and the count of handsets for that suite and platform as the value. So the mappers produce output which eventually becomes the Reduce step's input.
But a little work is required there. The work I did in the lobby, making sure all data for a given key from the mappers' output went to one, and only one, reducer (i.e. one of the fire wardens on the 10th, 20th and 30th floors) as input, made me the master node (as opposed to a worker node, where the mapper and reducer work takes place). As it happens, I sent data for two keys to each reducer node, which is fine. Each reducer then processed its input and generated an output file of its own with platform as key and the grand total handset count in the building for that platform as the value. My assistant, acting as the output writer, concatenated the reducers' output into a single output file for the job. All of the fire wardens and I acted as the nodes in the cluster.
Belaboring the analogy ever more, imagine that building management didn't want to be outdone by the new World Trade Center building under construction downtown, and so added 50 new floors to the building. My method for getting smartphone platform counts would still work. I'd merely scale out my cluster by enlisting the new floors' fire wardens (i.e. I'd add nodes to the cluster). At any point in time, if one of them quit or got fired, I'd merely enlist the help of the new fire warden on that floor, since, in my example, I'd be treating each node as mere commodity hardware. (That would be kind of impersonal of me, but I have work to do). Best of all, I'd get the answer just as quickly this way, since each fire warden would be counting his or her input data at the same time (and thus processing in parallel).
A Map step is typically written as a Java function that knows how to process the key and value of each input and then emit a new set of key-value pairs as output. The Reduce step is written as another function that does likewise. By writing just two functions in this way, very complex problems can be calculated rather simply. And by enlisting more and more nodes in our Hadoop cluster, we can handle more and more data. Hadoop takes care of routing all the data amongst nodes and calling my Map and Reduce functions at the appropriate times.
But Map and Reduce functions can be written in many languages, not just Java. Moreover, software can be used to provide abstraction layers over MapReduce, allowing us to give instructions in another manner and rely on the software to generate MapReduce code for us. To provide just two examples, a Hadoop companion called Hive provides a SQL (structured query language)-like abstraction over MapReduce and another companion technology called Pig does so with its own Pig Latin language.
We'll talk more about Hive, Pig and other products in future posts, as we will about langauges that can be used in place of Java to write MapReduce code. We'll look at how analytics and data visualization products use Hive, Pig and other technology to make themselves Big Data-capable. And we'll understand that it all comes down to MapReduce, which isn't so mysterious after all.