We knew this day would come. As I Apache Hadoop 2.0 was released to general availability, and now top Hadoop vendor Hortonworks has responded in kind with the 2.0 version of its own Hortonworks Data Platform (HDP) distribution. Given that Hortonworks architect and co-founder Arun Murthy was at the helm of the Apache Hadoop release, a prompt update from Hortonworks was absolutely expected.last week,
Knitting some YARN
The hallmark of the 2.0 releases of Apache Hadoop and HDP is the inclusion YARN -- an acronym for Yet Another Resource Negotiator -- which factors out the management components of Hadoop's MapReduce engine from the MapReduce processing algoritm itself. In other words, while Hadoop 2.0 can use MapReduce to process data, it is now just one of potentially many algorithms that can plug into the engine.
If you're looking for a quick and dirty definition for MapReduce, it's an algorithm that preprocesses data into key and value pairs in a "Map" step, then aggregates or consolidates that data in a "Reduce" step, and does this in parallel across multiple nodes in a compute cluster.
Stinger, take II
Along with YARN comes further development to a Hortonworks-led project called Stinger. Stinger aims to make Apache Hive 100x faster than it had been before the project started. Hive is the Apache engine that essentially converts SQL queries to MapReduce jobs, thus allowing common reporting and BI tools to query Hadoop, albeit somewhat slowly.
Phase 1 of Stinger (which includes SQL compatibility enhancements, column store technology, compression and in-memory hash joins) had already been implemented. With the release of HDP, and Hadoop, 2.0, Hive 0.12 and Stinger Phase 2 -- where Hive continues to use MapReduce but nonetheless benefits from YARN running underneath it -- is now being delivered. Hortonworks told me that Stinger Phase 2 now delivers 60x-70x performance improvements over the pre-Stinger releases of Hive, due in large part to straight Hive improvements, including vector-based queries and optimizations for so-called star joins that are common in data warehouse-type query scenarios.
Stinger Phase 3 will run on an engine called Tez (the Hindi word for "speed," pronounced "taze"), which will swap MapReduce out entirely. And given that Stinger Phase 2 has achieved the 60x-70x mark already, Hortonworks told me they felt rather confident that Stinger Phase 3 may well exceed the 100x goal originally set out for the project. It should be noted that any Hadoop distro that includes Hadoop 2.0 and Hive 0.12 will contain the Stinger Phase 2 improvements, as the code is not proprietary to Hortonworks, but part of the Open Source Hive project.
What's in the box
HDP 2.0 includes updates across the various Hadoop stack components, including HBase, Pig and, as already mentioned, Hive. The following figure details the various versions of Apache project releases included with various releases of HDP:
Products from a number of partners, including Microstrategy, Tableau, Splunk, WANdisco, Talend and Elasticsearch are already HDP 2.0-certified.
The HDP Sandbox, a pre-built virtual machine image containing the full HDP installation, and will be updated for HDP 2.0 shortly and will include tutorials from Hortonworks, but also from third parties including Microsoft, Talend and Tableau.
Hadoop on Windows
Speaking of Microsoft, the Windows flavor of HDP 2.0 will drop in mid-November, according to Hortonworks. The reason for the short lag rests mostly on some finishing touches around integration between Apache Ambari and Microsoft System Center. The Windows version of the HDP Sandbox, available in VMWare, Virtual Box and Hyper-V virtual machine formats, will be updated as well.
The HDP 2.0 release is big news...and with Strata/Hadoop World coming up next week in New York City, I expect there will be a lot more very soon.