What are some of the cool things in the 2.0 release of Hadoop? To start, how about a revamped MapReduce? And what would you think of a high availability (HA) implementation of the Hadoop Distributed File System (HDFS)? News of these new features has been out there for a while, but most people are still coming up the basic Hadoop learning curve, so not a lot of people understand the 2.0 stuff terribly well. I think a look at the new release is in order.
Honestly, I didn't really grasp HA-HDFS or the new MapReduce stuff myself until recently. That changed on Thursday, when I had the pleasure of speaking to a real Hadoop Pro: Todd Lipcon of Cloudera. Todd is one of the people working on HA-HDFS and in addition to his engineering knowledge, he possesses the rare skill of explaining complicated things rather straightforwardly. Consider him my ghost writer for this post.
What's the story, Jerry?
To me, Hadoop 2.0 is an Enterprise maturity turning point for Hadoop, and thus for Big Data:
HDFS loses its single-point-of-failure through a secondary standby "name node." This is good.
Hadoop can process data without using MapReduce as its algorithm. This is potentially game-changing.
If you wanted an executive summary of Hadoop 2.0, you're done. If you'd like to understand a bit better how it works, read on.
The big changes in Hadoop 2.0 come about from a combination of a modularization of the engine and some common sense:
Modularization: The "cluster scheduler" part Hadoop MapReduce (the thing that schedules the map and reduce tasks) has become its own component
MapReduce is now just a computation layer on top of the cluster scheduler, and it can be swapped out
Common Sense: HDFS, which has a "name node" and a bunch of "worker nodes," now features a standby copy of the name node
The standby name node can take over in case of failure of, or planned maintenance to, the active one. (Lipcon told me the maintenance scenario is the far more common use case and pain point for Hadoop users)
The scale out infrastructure of Hadoop can now be used without the MapReduce computational approach. That's good, because MapReduce solutions don't always work well, even for legitimate Big Data solutions. And in my conversation with Pervasive's CTO Mike Hoskins, I learned that even if you subscribe to the MapReduce paradigm, Hadoop's particular implementation can be improved upon:
With this new setup, I can see a whole ecosystem of Hadoop plug-in algorithms evolving. Certain Data Mining engines have used that to particular advantage before, and the Hadoop community is very large, so this has major potential. We may also see optimized algorithms evolve for specific hardware implementations of Hadoop. That could be good (in terms of the performance wins) or bad (in terms of fragmentation). But either way, for Enterprise software, it's probably a requirement.
And with new algorithms comes the ability to use languages other than Java, making Hadoop more interesting to more developers. In my last post, I explained why Hadoop's current alternate-language, mechanism, its "Streaming" interface, is less than optimal.
Pluggable algorithms in MapReduce will transcend this problems with Streaming. They'll make Hadoop more a platform, and less an environment. That's the only way to go in the software world these days.
The High Availability implementation of HDFS makes Hadoop more useful in mission-critical workloads. Honestly, this may be more about perception than substance. But who cares, really? If it enhances corporate adoption and confidence in Hadoop, it helps the platform. Technology breakthroughs aren't always capability-based.
A note on versions
When can people start using this stuff? Right now, in reality; but more like "soon" in practice. To help clear this up, let's get some Hadoop version and nomenclature voodoo straight.
Hadoop pros like to speak of the "0.20 branch" of the product's source code and the "0.23 branch" too. But each of those pre-release-sounding version numbers got a makeover, to v1.0 and 2.0 respectively. The pros working closely with this stuff may still use the decimal designations even now, so keep in mind that discussions of 0.23 really reference 2.0.
I should also point out version 4 of Cloudera's CDH ("Cloudera’s Distribution Including Apache Hadoop"), which is based on Hadoop 2.0, is in Beta. As such, I think of Hadoop 2.0 as beta software, even if you can pull the code down from Apache and easily put it in production. It's still more of an early adopter release, so proceed with caution.
MapReduce in Hadoop 2.0, by the way, is sometimes referred to as MapReduce 2.0, other times as MRv2 and, in still different circles, as YARN (Yet Another Resource Negotiator).
Cloudera has blog posts on both HA-HDFS and MRv2 that cover the features in almost whitepaper-level technical detail. Beyond technical explication, Cloudera's being pretty prudent in their implemntation: CDH4 actually contains both the 1.0 and 2.0 implementations of Hadoop MapReduce (you pick one or the other when you set up your cluster). This will allow older MapReduce jobs to run on the new distro, with HA-HDFS. In effect, Cloudera is shipping MRv1 as a deprecated, yet supported, feature in CDH 4. That's a good call.
Hadoop is growing up, and it's not backing off. Learn it, get good at it and use it. If you want, you can download a working VMWare image of CDH3 or component tarballs of CDH4 right now.