ODPi runtime spec aims to defrag Hadoop

Can a vendor consortium help standardize Hadoop's core, or will distro differences continue to grow like weeds?

The Open Data Platform initiative (ODPi) has announced its first Runtime Specification, and associated test suite, for Hadoop. Aimed at creating a universal spec for the core components in a Hadoop distribution, in order to standardize, reduce fragmentation and maximize compatibility, the ODPi Runtime makes its debut today after much fanfare, going back over a year.

Derived from Apache Hadoop 2.7, the runtime specification features HDFS, YARN, and MapReduce components and is part of the common reference platform ODPi Core.

Initial controversy
When it started, the ODPi was somewhat hampered by a conspiracy theory -- that the organization was a vehicle for Hortonworks (a dominant founding member, along with Pivotal) to standardize Hadoop around its own distro, the Hortonworks Data Platform (HDP). ODP was in fact based on the core of HDP (and only one letter off from it), and even included Apache Ambari which, although an Apache open source project, was and is nonetheless Hortonworks' technology, and not used by Cloudera or MapR.

Perhaps predictably, some dissent ensued. Mike Olson of Cloudera blogged about the company's opposition to the ODPi (then known as ODP) and felt it displayed disregard for the Apache Software Foundation's governance of the Hadoop project. MapR was similarly unenthused. IBM and SAS, meanwhile, joined the consortium, as did Altiscale, and various other firms, including CapGemini.

Evolution
But some positive changes came about, including branding (ODP is now known as ODPi), governance (the project moved under the umbrella of the Linux Foundation) and leadership (Hortonworks took a more backseat role, and several other companies -- there are a total of more than 25 now -- signed on).

And now that the Runtime spec is released, we find out that ODPi wisely decided to define Ambari as non-core -- including it instead in a complementary "Operations Specification." That's a smart, consensus building move. Arguably, it should have been part of the ODPi's initial foray -- but at least it's there now.

Why it matters
Hadoop distros have numerous components, each with a long release history, making for a huge number of permutations. That creates expense and long testing cycles for ISVs looking to guarantee compatibility. And if it's hard for the ISVs who are specialized in the Hadoop space, imagine how the poor customers feel.

Decades ago, UNIX was a popular operating system, in contention with DOS and Windows for dominance in what we've come to call the enterprise. But the number of UNIX variants was vast, and the resulting customer confusion was damaging to the OS. While the Hadoop ecosystem is different, its participants need to be vigilant in avoiding a similar fate.

I'm not prepared to bet money that Cloudera, MapR and Amazon will consider making their distros ODPi-compliant. But I am willing to bet that if they did, the big data industry would be helped.