MapReduce, streaming beyond Java

Hadoop Streaming allows developers to use virtually any programming language to create MapReduce jobs, but it’s a bit of a kludge. The MapReduce programming environment needs to be pluggable.
Written by Andrew Brust, Contributor

I gave an introductory talk on Hadoop yesterday at the Visual Studio Live! conference in Las Vegas.  During the talk, I discussed how Hadoop Streaming, a utility which allows arbitrary executables to be used as the Hadoop’s mappers and reducers, enables languages other than Java to be used to develop MapReduce jobs.  For attendees at the conference, the take-away was that they could use C#, the prominent language of Microsoft’s .NET platform, to work with Hadoop.  Cool stuff.  Or is it?

I likened Hadoop Streaming to the CGI (Common Gateway Interface) facility on Web servers.  In the pioneering days of the Web, CGI was frequently used to create Web applications with any programming language whose code could be compiled to an executable program.  While not the most elegant way to do Web development, CGI opened up that world to developers working with common business programming languages.  CGI still works; in fact, we now have FastCGI and, to this day, the PHP programming language can still work within Web servers’ CGI frameworks.

Likewise, Hadoop Streaming allows developers to use virtually any programming language to create MapReduce jobs.  And just like CGI, it’s a bit of a kludge.  It works, but it ain’t pretty.  Writing MapReduce code in C# means creating two separate “console applications” (i.e. those that run from what is essentially the modern equivalent of a DOS prompt) with the mapper code in the main() function of one, and the reducer code in that of the other.  There’s no Hadoop context as a developer writes the code, and the two executables have to be compiled, then uploaded to the Hadoop server.  When you’re done with all that, you can get the job to run.  It’s impressive, but when you’re done with it, you feel like you need to wash your hands.

If Hadoop is to go truly mainstream, if it’s going to take over the enterprise, if it’s going to capture the hearts and minds of today’s line of business application developers, then it will need to host non-Java execution environments more explicitly and simply.  MapReduce needs to be pluggable, just like Web servers are now.  Web applications can be easily developed in C# (and other languages), which is why the Web became mainstream as a platform, even inside the firewall.  Now the same must happen with MapReduce.

In my discussion with Pervasive Software’s Mike Hoskins last week, he told me that such a pluggable framework is on its way.  As far as I’m concerned, it can’t come soon enough.  The market has already proven that Java-only isn’t good enough.  First class status for other programming languages, and other programmers, is the way to go.

Editorial standards