The MapReduce 101 story, in 102 stories

The MapReduce 101 story, in 102 stories

Summary: Can a skyscraper completed in 1931 be used to explain a parallel processing algorithm introduced in 2004? In this post, I use the anology of counting smartphones in the Empire State Building to explain MapReduce...without using code.

SHARE:

A little over a year ago when I started my company, I was able to find a small office in the Empire State Building.  I'm on the 72nd floor facing south, so the view is amazing.  I wish I had better Internet service options though; I've realized it's just not that attractive to service providers to pull their cables to the top of such a tall, old building.  In time, though, I've decided that the building might be more tech-savvy than I realized.  That's because, with only a little contrivance, I believe I can use the building to explain MapReduce, without using code.

One of the things I do in my work is follow market share figures for various smartphone platforms.  I typically rely on the findings of the larger analyst firms to figure out what's what, but I dream of one day getting getting my own numbers instead.  It struck me recently that if I had a little more pull at the ESB, I could just total up the different smartphone handsets, by platform, in the building.  After all, the building has a good distribution of city and suburban dwellers, different income levels, and a large enough population to have its own 5-digit zip code.

As I continue this data-gathering day dream, I think through how I could go about counting all these cell phones.  I certainly couldn't do it myself.  Even if I had the patience and the speed, the inefficiencies in getting between floors would hurt my performance, as the elevators can be slow, and no employee in the building is happy about people who get on and then off one floor later.

But then I have an idea.  Since every floor has a fire warden whose job it is to count people, maybe I could use those folks as my agents on each floor.  Each floor fire warden could go into each suite on his or her floor and write down, on a separate piece of paper for each major smartphone platform, the platform name and total number of handsets.  I could tell the fire wardens to create a separate sheet of paper, per suite, for iOS, Android, Blackberry, Windows Phone, webOS and Symbian and could also tell them to disregard other phones.  Each fire warden would likely have multiple sheets per platform, of course, since each sheet's count would correspond to a particular suite on the floor. But that's just fine.

When the fire wardens were done in all suites, they could put all their sheets in an envelope and drop it in the mail chute (in the hypothetical case that the chutes were still in use.)  I could be waiting in the lobby, and when I knew that all fire wardens had completed their work, I could go around to the mail boxes at each chute and collect the envelopes with the smartphone count sheets.

As a next step, I'd go sit at the security desk, open all the envelopes and sort the sheets, by smartphone platform, into six new piles, putting each pile in an envelope.  I'd have an intern bring two of the new envelopes up to the 10th floor, another intern bring two more to the 20th, and my third intern bring the last two to the 30th floor.  The fire wardens on each of those three floors would open an envelope, total up the counts on the individual sheets, and write down the platform name and that grand total on a new sheet of paper.  He or she would then repeat the process for the other envelope, writing its platform name and handset total on the same sheet of paper as the first.  Each of my three interns would then take these new sheets from the fire wardens up to my office on the 72nd floor, where an assistant would be waiting.  He'd  then put the data from all three sheets of paper into a single spreadsheet, with platform names in column A and handset counts in column B.  And with that I'd have my smartphone stats for the building.  With the help of the friendly fire wardens, I'd get my answer pretty quickly too.

This example's not perfect, and I might update this post over time to make it more so.  But if you can understand the process I just explained, then you can understand MapReduce.   Just let this stuff sink in for a bit.  In my next post, I'll introduce the vocabulary (jargon?) used in MapReduce-speak to explain what the building employees, suite numbers, smartphone platform names, handset counts, fire wardens, sheets of paper, and the final spreadsheet represent.

Topics: Smartphones, Hardware, iOS, Mobile OS, SMBs

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

6 comments
Log in or register to join the discussion
  • how does map-reduce differ

    Can't the same thing be accomplished with multithreading? Kick off a thread for each floor and use additional threads to combine the results?
    ababiec@...
    • Yes

      You can make your own framework using threads and managing them etc to process.

      To make it the most of it you need to manage the resources manually. If you use something like Hadoop it can do this management for you, and span over a large number of servers to process the results.

      Take a look at the overview http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Overview
      x21x
  • RDBMSs already do parallel processing

    It's just another optimization method that usually does its work quietly in the background, the programmers generally don't need to know anything about it.

    No reason to change you database architecture (and cause a lot of pointless disruption)to get parallelism.
    jorwell
    • parallel processing is limited

      True, as long as your processing procedures don't take an extreme amount of time on your servers. Also true, as long as your database has a reasonable number of rows per table. Once you get above 8 hours of processing and/or around 1B rows in tables, you're reaching the limits of most RDBMS parallel processing. Big data procedures are not necessary for a majority of database implementations, but where it is you have to think about it. And if there is a way to speed up even RDBMS processes by using big data procedures, it sure can't hurt.
      brentgee
      • So not very relevant for most businesses

        As the vast majority of businesses don't have 1 billion row tables (unless they are doing some absurd denormalization), big data can be safely ignored by most people.

        The big data optimization techniques will eventually end up (if they prove to be valuable) in the optimization techniques of RDBMSs.
        jorwell
  • mobile phones in ESB

    It doesn't seem like it would be that hard to gather this data. You set up a beacon in the lobby or something and any device that scans for a data channel is going to communicate information about itself in the headers -- or you can send a fake pingback and ask for that information.
    crasshopper@...