Just in time for the O'Reilly Strata conference, three companies in the tech world have announced new distributions of Apache Hadoop, the open-source, MapReduce-based, distributed data analysis and processing engine. Hortonworks Data Platform (HDP) 1.1 for Windows, EMC/Greenplum Pivotal HD, and the Intel Distribution for Apache Hadoop have all premiered this week on the big data stage.
Join the club
The big data world has been host to several Hadoop distributions. Cloudera and Hortonworks are prominent, with both companies claiming pedigrees from the original Hadoop team. MapR is there too, especially in the cloud, given its alliances with Amazon Web Services and RackSpace. IBM has its own distro as well — dubbed InfoSphere BigInsights — which includes a little magic dust to integrate with Netezza and DB2.
Hadapt has its own distro, mashed up with a Massively Parallel Processing (MPP) data warehouse. Microsoft has its HDInsight Service (on its Windows Azure cloud platform) still in previews, as well as plans to introduce a Windows Server flavor of the product.
That's a lot of Hadoop, especially considering that the core Apache code can be used as well. So why would new distributions emerge?
To each, his own
If we deconstruct these announcements a bit, we can see that the three companies are customizing Hadoop in special ways, to further their own interest in big data.
Intel has enhanced the core Hadoop Distributed File System (HDFS), the YARN ("yet another resource negotiator")/MapReduce v2 engine, the SQL-like query layer Hive, and NoSQL store HBase to take advantage of Intel processor, solid-state drive (SSD) storage, encryption, and 10Gb Ethernet technology. These enhancements have been contributed back to the Apache Hadoop project. IBM is also offering the proprietary Intel Manager for Apache Hadoop, to handle deployment, configuration, monitoring, alerts, and security.
Using technology that EMC/Greenplum calls "HAWQ", the company is integrating its MPP product with Hadoop, much as Hadapt, Teradata Aster, and ParAccel have done, and Microsoft soon will, with the PolyBase component of its SQL Server Parallel Data Warehouse product. Cloudera's Impala product fits in this category as well, though Cloudera is a Hadoop vendor implementing new MPP technology, the exact opposite of EMC/Greenplum's approach.
Both companies are already building out their ecosystems, with companies like Cirro announcing support for Pivotal HD as well as SAP and MarkLogic announcing support for the Intel Distribution for Apache Hadoop.
The Hortonworks announcement is a bit harder to interpret, especially because Microsoft and Hortonworks are partners, such that Microsoft's HDInsight is already based on the HDP Windows code base. With that being the case, Hortonworks' new distro might seem superfluous, at first blush.
To decode the Hortonworks news, I spoke with two important people on the scene: Shaun Connolly, VP of Corporate Strategy at Hortonworks, and Herain Oberoi, director of product marketing and SQL server product management at Microsoft. Both gentlemen provided me with rational explanations that were, thankfully, in agreement with each other.
Microsoft HDInsight Server, when released, will integrate with Microsoft technologies like System Center and Active Directory. For Microsoft shops, this is crucial and will help provide a smooth path to Hadoop. HDP for Windows, meanwhile, will be a more straightforward Hadoop Distro that just so happens to run on Windows, rather than Linux. For Hadoop shops, it will provide a smooth path to the large chunk of the x86 server population running Windows (70 percent of the market, according to Hortonworks' Connolly), as Hortonworks will maintain consistency of HDP across Linux and Windows.
Oberoi told me that HDInsight will be a superset of HDP. Connolly invoked a "Russian doll" metaphor to convey the same policy. Both parties told me that code developed against one distro should port seamlessly to the other. Connolly even told me that code written using Microsoft's .net software development kit (SDK) for Hadoop should work against HDP for Windows. Connolly also told me that HDP for Windows will include Microsoft's own Open Database Connectivity (ODBC) driver for Hive, rather than the Simba Technologies-provided driver that Hortonworks ships with HDP for Linux.
When I asked if new versions of HDInsight and HDP for Windows would ship in tandem, so as to maintain this compatibility and superset dependency, the responses I received were less definitive. My hope would be that the two companies stick with such a plan — and that the alliance doesn't go the way of Microsoft's erstwhile relationship with Sybase that gave birth to the SQL Server relational database. Meanwhile, given that all of the HDP code is open source, I suppose a parting of ways would be less impactful in the Hortonworks case.
Regardless of future outcomes, there is an immediate upside. While an invitation-based preview of the (cloud) HDInsight service has been ongoing for some time, no bits are available yet for the (on-premises) HDInsight server. Hortonworks' HDP for Windows, therefore, will finally allow for Windows shops to set up multinode Hadoop clusters on Windows Server 2012 or 2008 R2.
The burden of choice
With so many Hadoop distros out there, the question of fragmentation is hard to avoid. In many ways, the Hadoop world is starting to reflect the Unix scene of the 1980s and the Linux landscape of the last decade-plus. Ironically, Hadoop is being so universally adopted that it's not especially consistent from one vendor environment to another.
Another take on this, however, is that the greater Hadoop's adoption, the more infrastructural, and less exposed, it becomes. It's a bit like TCP/IP, the now-standard network protocol used throughout the industry. Every operating system supports it, and all of them integrate it tightly in their platforms. So it's customized, but it's also inter-operable. Perhaps Hadoop is destined for similar embrace, extension, and embedment.