As Hortonworks' Hadoop Summit event kicks off in Dublin today, the Hadoop distribution vendor has a full slate of announcements. The announcements themselves are substantial and impressive, and I'll cover each of them here.
As you read through them, however, keep in mind that they at once highlight and reinforce the idea that the "retail" Hadoop world is becoming split in two -- as Hortonworks and Cloudera each introduce unique components in their distros that often meet corresponding needs and requirements.
Announcements, please First off, a bit of a bombshell. Pivotal, which entered the Hadoop distribution race over three years ago, with the introduction of Pivotal HD, will now be reselling Hortonworks Data Platform (HDP), which is Hortonworks' Hadoop distribution.
Little by little, Pivotal has been winding back its Hadoop ambitions: first, it announced that Pivotal HD would conform to the Open Data Platform initiative's (ODPi's) specs; next, it outsourced Hadoop support to Hortonworks; then it open sourced all of its proprietary data components, including its HAWQ SQL-on-Hadoop engine, and even the Greenplum MPP data warehouse product from which HAWQ was derived.
Now Pivotal is sunsetting Pivotal HD, and transitioning to what it will brand Pivotal HDP (which Hortonworks' press release assures "is 100% identical to the Hortonworks Data Platform"). Meanwhile, HDB, Pivotal's distribution component based on what is now Apache HAWQ, will make its way to the Hortonworks camp, as Hortonworks HDB, which will be available as an add-on subscription to the core HDP. HDGotAllThat?
From HDP to DMX Another announcement Hortonworks has is that it will be doing some reselling of its own. Specifically, the company will be reselling Syncsort's DMX-h product, which integrates mainframe-based and other legacy ETL processing with Hadoop.
I don't have too much analysis on this one. Syncsort, recently acquired by private equity firm Clearlake Capital, is a good company with interesting -- if a bit niche -- technology. As Hortonworks looks to fortify its Enterprise prowess, being able to offer its customers Syncsort's technology directly seems fairly common-sense.
Shh! The tech previews are starting Next, Hortonworks is announcing that it has integrated Apache Atlas (incubating), which comes out of Hortonworks' Data Governance Initiative, and Apache Ranger (incubating), which is built on technology that came to Hortonworks when it bought XA Secure, almost two years ago. Atlas provides data governance features. Ranger allows for role-based access control across various components of the Hadoop stack.
The integration, which is being launched as a tech preview, allows for the creation of what Hortonworks calls tag-based security policy, whereby customers can, according to Hortonworks' press release "use Atlas to classify and assign metadata tags, which are then enforced through Ranger to enable various access policies."
Another tech preview -- and one Hortonworks says is the final one -- being launched is a new release of Apache Zeppelin (incubating), the developer's "notebook" project for doing data science work with Apache Spark.
New Ambari, Cloudbreak and Metron, oh my! Not everything is in tech preview though. Hortonworks is bringing into general availability (GA) a new 2.2 release of Apache Ambari. Ambari is a management console for Hadoop clusters, and this new version includes new visual dashboarding features, to help ease the management burden.
Also hitting GA is version 4.2 of Cloudbreak, a product Hortonworks added to its portfolio with its announced its acquisition of SequenceIQ, one year ago tomorrow. Cloudbreak is a tool for easing and automating Hadoop cluster cloud deployments. It has support for "recipes" (manifests that allow the automated specification of Hadoop clusters and their nodes), auto-scaling, deployment to OpenStack-based clouds, and integrations with Microsoft's Azure Blob storage...the storage layer that Microsoft's Hadoop distribution, HDInsight, uses by default.
Hortonworks' last announcement concerns cyber security and threat detection technology, and a brand new open source project to deal with it: Apache Metron (incubating). Hortonworks says that Metron works at the application, system and packet level, and also reads feeds from tools, looking at all of them to find anomalies that may indicate an attack in the works.
Tale of two distros With Pivotal throwing in the towel on Pivotal HD, the number of major Hadoop distributions is down to four, including those from Hortonworks, Cloudera, MapR and IBM. Microsoft's HDInsight is already based on HDP, and IBM's is now ODPi-compliant, making it come pretty close. MapR is certainly distinct, but the company seems to work very closely with specific customers to ensure their success with a number of proprietary technologies like the HDFS-compatible MapR File System, HBase-compatible MapR DB and Kafka-compatible MapR Streams. MapR's is not an "off-the-rack" Hadoop distribution.
So in what we might call the "retail" Hadoop space, it's really coming down to Hortonworks and Cloudera. And at the same time that this is happening, the two companies are causing their distributions to diverge.
Anything you can do, I can do...different The combination of Hue and Cloudera Manager performs in much the same capacity as does Hortonworks-backed Apache Ambari. Cloudera Director and CloudBreak have much in common. Cloudera-backed Apache Sentry goes head-to-head with Hortonworks-backed Apache Ranger. Cloudera Navigator has an opposite number in Hortonworks' favored Apache Atlas.
Getting the picture? It even extends to SQL-on-Hadoop, as Hortonworks worked hard to get Apache Hive running on Apache Tez, supports Spark SQL and will now be selling subscriptions to HDB, based on HAWQ. Cloudera, meanwhile, has its Apache Impala (incubating) technology and has recently announced version 5.7 of its Cloudera Enterprise distribution, which supports Hive running on...Spark.
Duopoly The retail Hadoop world is getting a bit like the US political system: two main parties, who are becoming ever more polarized, along with a few surrounding secondary parties. I don't know which company is the Republicans' counterpart, and which is the Democrats', but given the rather dysfunctional state of things around the U.S. presidential election, neither one is exactly an accolade.
Maybe I shouldn't use such a topically political analogy. Maybe the houses of Montague and Capulet would be the more appropriate comparison? Or, given that I am a life-long New Yorker, maybe it's the Jets and the Sharks. (I wonder if Tom Reilly, Mike Olson and Rob Bearden can sing and dance while they rumble...but I digress.)
A fine mess they've gotten us into How did we get here? The open source and commodity hardware roots of Hadoop (and Spark) were supposed to make code and customers portable across distributions. No lock-in at any layer of the stack, right?
Look, I'm all for innovation in the white space at the top of the stack. But, increasingly, we've got dueling products -- even open source projects -- that do the same things. The result is a bifurcation in the Hadoop space. It wasn't supposed to be that way.