Hadoop Summit San Jose has come to an end. This year, I was there to cover the news, and to present a breakout session. My talk focused on fragmentation in the industry: the Big Data ecosystem has too many vendors, too many Hadoop distributions, too many execution engines, too many Apache projects. The result? An overly complex market for products and technologies that makes it really difficult for customers to make purchasing decisions.
And maybe getting my talk ready biased the way I analyzed the news but, interestingly, it sure seemed like the issues I addressed in my session were on the minds of some of the news-making exhibitors at the conference.
Please release me; let me go
A great example of this is how Hadoop distribution vendors are addressing release cycles. Think about it: vendors include releases from a number of open source projects in their own releases, and so find themselves downstream of numerous rapid release cycles. The vendors need to avoid making their customers wait too long to get their hands on the new features in those releases, but need to do it in a way that doesn't force customers to adopt major new distro upgrades too frequently.
On Tuesday, I covered MapR's approach: to release new open source bits in MapR Ecosystem Packs (MEPs) that are optional, and released on a more frequent cadence than full-fledged MapR releases. As it turns out, Hortonworks announced it will be doing something similar: updating the core of the distribution (YARN, MapReduce and HDFS) as frequent dot releases, and updating the rest of the distribution with less frequent full version releases.
That distro core is aligned with the Open Data Platform initiative (ODPi) Runtime Specification. That makes sense, since Hortonworks is a founding ODPi member. And at the show, ODPi announced that other members have come on board to align their own distributions in a likewise manner. This includes Altiscale, a Hadoop-as-a-Service provider with its own Hadoop distribution, a somewhat obscure Russian Hadoop distribution from Arenadata and one more from India-based global systems integrator InfoSys. As a reminder, IBM BigInsights was already on board.
But as good as the news is toward distributions unifying, MapR and Cloudera are still not members of ODPi; in fact, they continue to be philosophically opposed to its existence. And an upcoming project, the ODPi Operations Specification, could make things worse. This new spec will premised on Apache Ambari - a component that Hortonworks uses in its distribution. Cloudera, meanwhile, has its own Cloudera Manager component and MapR announced its own solution, the Spyglass Initiative, on Tuesday.
So, for now, ODPi looks to be a standard aligning Hortonworks Data Platform (HDP), with a collection of distant second-place distributions. There is hope though. John Mertic, of the Linux Foundation, who is ODPi Director of Program Management, explained to me that the Runtime Spec may be broadened to include parameters for a Hadoop Compatible File System (HCFS) in place of standard HDFS. This would ostensibly allow MapR and Microsoft (which replace HDFS with MapR-FS and Azure blob storage, respectively) to adopt the ODPi Runtime Spec. And since the more divisive Operations Spec would be optional, maybe this could break the logjam.
Hortonworks also announced a preview release of an engine the company is calling LLAP ("live long and process") -- which, along with Tez, enables sub-second query times on Hive. Hortonworks also said it's adding a new component to HDP: Apache Phoenix, designed as a relational database for OLTP and operational analytics applications, and built on top of Apache HBase.
While this clearly brings new innovation to HDP customers, it also makes the SQL-on-Hadoop landscape even more untidy: Phoenix is a brand new SQL-on-Hadoop option, and LLAP now introduces a fourth operation mode for Hive (the other three being MapReduce, Tez-only and Spark).
Governance, security; unity?
The folks at Dataguise, at least, are focusing on unification, of a sort. Its DgSecure product brings AI-driven data governance and security (using machine learning to suggest and automate identifying metadata and imposing restrictions on certain rows and columns of data) to all Hadoop platforms and integrates it with Apache Ranger and Atlas.
At Hadoop Summit, Dataguise announced version 6.0 of the DgSecure product. Here's an example of an ISV adding value and avoiding fragmentation by building compatible technology. Will Dataguise add support for Apache Sentry and Cloudera Navigator? I hope so, as it would set a great example for the rest of the Big Data ecosystem.