MapReduce, streaming beyond Java

MapReduce, streaming beyond Java

Summary: Hadoop Streaming allows developers to use virtually any programming language to create MapReduce jobs, but it’s a bit of a kludge. The MapReduce programming environment needs to be pluggable.


I gave an introductory talk on Hadoop yesterday at the Visual Studio Live! conference in Las Vegas.  During the talk, I discussed how Hadoop Streaming, a utility which allows arbitrary executables to be used as the Hadoop’s mappers and reducers, enables languages other than Java to be used to develop MapReduce jobs.  For attendees at the conference, the take-away was that they could use C#, the prominent language of Microsoft’s .NET platform, to work with Hadoop.  Cool stuff.  Or is it?

I likened Hadoop Streaming to the CGI (Common Gateway Interface) facility on Web servers.  In the pioneering days of the Web, CGI was frequently used to create Web applications with any programming language whose code could be compiled to an executable program.  While not the most elegant way to do Web development, CGI opened up that world to developers working with common business programming languages.  CGI still works; in fact, we now have FastCGI and, to this day, the PHP programming language can still work within Web servers’ CGI frameworks.

Likewise, Hadoop Streaming allows developers to use virtually any programming language to create MapReduce jobs.  And just like CGI, it’s a bit of a kludge.  It works, but it ain’t pretty.  Writing MapReduce code in C# means creating two separate “console applications” (i.e. those that run from what is essentially the modern equivalent of a DOS prompt) with the mapper code in the main() function of one, and the reducer code in that of the other.  There’s no Hadoop context as a developer writes the code, and the two executables have to be compiled, then uploaded to the Hadoop server.  When you’re done with all that, you can get the job to run.  It’s impressive, but when you’re done with it, you feel like you need to wash your hands.

If Hadoop is to go truly mainstream, if it’s going to take over the enterprise, if it’s going to capture the hearts and minds of today’s line of business application developers, then it will need to host non-Java execution environments more explicitly and simply.  MapReduce needs to be pluggable, just like Web servers are now.  Web applications can be easily developed in C# (and other languages), which is why the Web became mainstream as a platform, even inside the firewall.  Now the same must happen with MapReduce.

In my discussion with Pervasive Software’s Mike Hoskins last week, he told me that such a pluggable framework is on its way.  As far as I’m concerned, it can’t come soon enough.  The market has already proven that Java-only isn’t good enough.  First class status for other programming languages, and other programmers, is the way to go.

Topics: Open Source, Software Development

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • It sounds archaic - what has this to do with modern DBMS technology?

    You might want to look at slide 23:

    But really, accessing a DBMS with a low-level, procedural, pointer-based programming language like Java, is this a joke or what?

    And jobs? What is this, a sixties vintage IBM mainframe?
  • Hadoop Streaming

    I've been using Hadoop Streaming extensively now at my job and its really cool. I agree that its a powerful platform, and much easier for programmers who want to use their programming language of choice for Hadoop/MapReduce.

    I tried Hadoop, Hive and Pig before settling on streaming - I think its only a matter of time before there are richer interfaces for Hadoop streaming.
  • Ah...

    Hey, I hear you. How many times do I see "First install cygwin." when reading about how to learn some cool technologies using a Windows system. Pshaw. Don't these computer scientists know Windows is the best operating system ever, save the next Windows viewable on the horizon, and clearly the first place to put their technology, and not in some environment that can be licensed at no cost? It's almost as though they think their money is better spent on the r&d and not in maintaining the cash flow to Microsoft so it may, nay, shall build the next awesome.

    Perhaps if Microsoft had actively supported .net on other platforms instead of giving Miguel de Icaza some cooperation, license to most of the api, and a real strong suggestion that mono developers and users would not be sued in the foreseeable future, then more projects would be C#/CLR based. Well, Microsoft thought they knew what they were doing.

    Any way, learning the DSL of the tech, and using one's language of choice and related driver to wrap the DSL and communication protocols worked out reasonably well for SQL. No? But if the Hadoop project do extend and wrap the protocols in other native languages, then that's good on them.