Microsoft does Hadoop, and JavaScript's the glue

Microsoft does Hadoop, and JavaScript's the glue

Summary: Microsoft has a reputation for modifying external technology when adopting it. But in the case of Hadoop, Microsoft is so far staying true to the core technology, providing optional integration with its own stack, and making it easier for people to work with Hadoop and get excited about it.

SHARE:

Microsoft's getting into the Hadoop game, and people are skeptical.  Can Microsoft really embrace open source technology?  And if it can, will it end up co-opting it somehow, or will it truly play nice?  Would you even want to run Hadoop on the Windows operating system?  Why bother?  Why care?

Based on what's in Microsoft's Hadoop distro and its cloud spin on things, I would say that you should care, and quite a lot.  Microsoft is not dumbing Hadoop down.  Instead, it is making it almost trivial to get a Hadoop cluster up-and-running.  If that's not enough for you then I think this should be: a browser-based console where you can work with Hadoop using a very friendly programming language.  That language isn't Visual Basic, Microsoft's 20-year stalwart for business application development.  And it's not C#, the favored language for the company's .NET platform.  Actually, it's not a Microsoft language at all.  Rather it's the language that runs the Web these days: JavaScript.  While the console lacks many of the niceties of modern relational database tooling, it's still very useful and convenient.  And when Microsoft's Hadoop distro becomes generally available (it's still in an invitation-only Beta phase right now) I think it may bring many more people into the Hadoop ranks, regardless of their preferred platform and persuasion.

Microsoft's Hadoop distribution, which it is building in partnership with Hortonworks, includes the core HDFS and MapReduce, plus a bunch more.  Microsoft's also throwing in Hive, Pig, Mahout, Sqoop, HedWig, Pegasus and HBase. (The last of these is no small feat for the creator of SQL Server).  The distribution can be installed on-premises on Windows Server or in the cloud on customers' Windows Azure "roles" (virtual machines). 

Perhaps the best option, though, is a Web browser-provisioning interface for standing up an entire Hadoop cluster in just a few clicks of the mouse.  Once the cluster is up and running, you can use Microsoft's Remote Desktop software to connect directly to the head node, and then go to a command prompt and hack around with Hadoop and all those components. But the interactive console offers an even better way.  It's a command line interface that gives you, all in one place, access to:

  • HDFS commands
  • JAR file-based MapReduce jobs
  • JavaScript expression evaluation
  • Pig
  • Hive
  • Basic charting (bar, pie and line graphs)
  • The pièce de résistance: a framework to execute MapReduce jobs written in JavaScript

Having one command line for all of this , and being able to mix and match it, is almost magical.  For instance, you can author MapReduce code in JavaScript and then from the browser-based console, you can upload the code, run it, write a Pig expression to extract some data from the results, convert the output file content to a JavaScript array and then display it in a bar chart.  That's very empowering: it allows you to get your feet wet with MapReduce, HDFS, Pig and some light data visualization in just five interactive lines of code.  And none of it uses any Microsoft technology, except of course Windows Server, which is cloud-based anyway, and therefore abstracted away.

There's more too. Like an ODBC driver for Hive that effectively attaches Excel and most of the Microsoft Business Intelligence stack to Hadoop.   But that's fodder for a separate post...or seven.

Microsoft's Hadoop offering should become generally available before too long.  But if you'd like to apply for an invite to the beta, create an account on "Connect" and then fill out the special survey.

Topics: Microsoft, Open Source, Software Development

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • Hadoop on Windows

    We get all the niceties on top of Hadoop from Microsoft + the untraceable crashes and hangs. Avkash had been actively blogging (http://blogs.msdn.com/b/avkashchauhan/) about Hadoop on Windows. It has many hands on tutorials and how-to. BTW, also heard that Microsoft is also contributing to Apache, not sure how much and what.

    Hadoop on Linux had been there in production for some time and some of the bugs @ scale would have been ironed out by now. Unless Hadoop on Windows is adopted @ scale the same would be difficult.

    > Perhaps the best option, though, is a Web browser-provisioning interface for standing up an entire Hadoop cluster in just a few clicks of the mouse.

    Similar thing is provided from Cloudera also.

    > There???s more too. Like an ODBC drive for Hive that effectively attaches Excel and most of the Microsoft Business Intelligence stack to Hadoop.

    Datameer had been doing the excel <-> Hadoop for some time.

    Hope my comment sticks this time :)
    praveensripati@...
    • What???

      You don't like untraceable crashes and hangs???

      It's all part of the experience!
      harvey_rabbit
  • @harvey_rabbit - We know Copy Paste works

    We saw your comment on top.
    arebangdu