Building on the apparent success of its Cloud Foundry PaaS project, EMC and VMware spinoff Pivotal today unveiled backers for a new initiative, aimed at defining a core set of Apache technologies to speed adoption of Hadoop.
The Open Data Platform's founding members - GE, Hortonworks, IBM, Infosys, Pivotal, SAS, and AltiScale, Capgemini, CenturyLink, EMC, Teradata, Splunk, Verizon and VMware - will test and certify a number of primary Apache components, which will then form the basis of their Hadoop platforms.
"This is an extremely important announcement for us. It's the equivalent of the Cloud Foundry Foundation step for platform as a service," Pivotal senior director of outbound product Michael Cucchi said. "This is the same calibre of event for big data and big data analytics - and obviously specifically to rapidly accelerate Hadoop."
In January, Pivotal said it had taken $40m in Cloud Foundry sales in the last three quarters of 2014. The open-source Cloud Foundry Foundation, which launched a year ago with seven participating organisations, now has more than 45 members. This week the company said it has secured $100m in big-data software bookings in 2014.
Shaun Connolly, VP corporate strategy at Open Data Platform platinum member Hortonworks, said his company has founded the initiative with Pivotal to provide a well-defined platform for the Hadoop ecosystem and help minimise fragmentation and duplication of effort.
"It's a strong rallying cry for the market around a common core that the industry can count on. It has enough participation across the various perspectives to make sure it reflects the needs of not only vendor agendas but end users," Connolly said.
"It's very well aligned with the Apache Software Foundation processes because we will be amplifying contributions through those Apache projects. The innovation in those projects will accelerate due to the participation from the broader community, which frankly will drive more enterprise capabilities in the core platform that people can take advantage of."
Connolly described the process of bringing together end users, vendors and individuals in the community to collaborate on a shared set of goals as "challenging".
"But if you look at our track record with the Stinger initiative and the Data Governance Initiative, we have more than a few years of demonstrating we can bridge those worlds," he said.
"Some might look at Pivotal and IBM and others as competitors. We have to set those differences aside and focus on the things we can do jointly. That's what this initiative is about. It just comes from working together and building trust and we're used to that. It's really what open source is about."
Connolly said the initiative is also designed to reduce complexity and confusion in the Hadoop field, which may act as a barrier to adoption.
"If you look at the Hadoop industry, there are shared name components. There are varying versions of those components that have different capabilities, different protocols and API incompatibilities. What this effort is aimed at is a stable version of those, so that takes the guesswork out of the broader eco-system," he said
"In the community there's a lot of releases - release early and release often due to the nature of the innovation that happens with the open-source model, which makes it really confusing to figure out which version to standardise on."
Pivotal, spun out of EMC and VMware in 2013, said the Open Data Platform will work directly with specific Apache projects, adhering to the Apache Software Foundation guidelines on contributing ideas and code. The goal is to increase compatibility and make it easier for apps and tools to run on any compliant system.
The reference core of Hadoop components will include resource management layer YARN and the Ambari monitoring and provisioning tool.
Connolly said the Open Data Platform will be open to any other company that wishes to participate.
"We're not looking to exclude any players. We're actually looking to ensure we will include as many of those players as possible, depending on how they want to participate," he said.
"The more interesting thing for the broader market is how you make it easier for solutions built on Hadoop, as well as other big-data technologies, to be deployed more quickly. So the faster the market can grow, the better our business is."
Deepening the relationship struck up last July with their collaboration on Ambari, Pivotal and Hortonworks are now going to be adopting a unified approach in a 'strategic and commercial alliance'.
As well as sharing a set of basic Hadoop components and some support activities, the two companies will be coordinating Hadoop engineering efforts, including those focusing on Pivotal services such as Hadoop SQL front end HAWQ, which Hortonworks will be offering as part of its platform.
Pivotal's Michael Cucchi said the aim is for his firm's advanced services to run on top of the Hortonworks Data Platform.
"HAWQ will be made available on Hortonworks and then we'll follow along with GemFire [NoSQL in-memory database] and Greenplum database integration into their Hadoop distribution. Those will also translate to other data platform distributions in the future, which is the whole point of this [Open Data Platform] initiative," he said.
"We're actually going to deliver Hortonworks advanced support to customers of Pivotal HD [Pivotal's Hadoop distribution] so the customer will get the world's best support for HAWQ from us and they'll get the world's best support for Hadoop from Hortonworks."
Hortonworks' Shaun Connolly said the relationship between the two companies will enable customers to use technologies such as HAWQ or GemFire on a YARN-based architecture on the Hortonworks data platform or with Pivotal HD.
"But if a customer buys a Pivotal Hadoop product, for instance, and they have issues around components that are Hortonworks primarily - we have the committers in the open-source community who are working on those - Pivotal is able to accelerate that support case from their support team into the experts at Hortonworks. So we'll be able to supply the level 2 and level 3 support seamlessly to Pivotal's customers," he said.
On top of the Open Data Platform announcement, Pivotal also revealed plans to open-source parts of its big-data technologies, including the core of the massively parallel processing Greenplum database, HAWQ, and the GemFire .
Pivotal's Michael Cucchi said open-sourcing core components of its Big Data Suite will increase adoption of the technology by the community and enable software and infrastructure providers to take the code and extend it.
"We're going to release our major pillar offerings and they will be fully functional open-source code bases. However, we're going to hold back advanced features," he said.
"You can think of it as a dual licensing model, where the core functionality of the product is in the open-source community but some advanced features will be available through licensing with Pivotal."
Examples of advanced features that Pivotal will reserve for enterprise licensing are Greenplum's Pivotal Query Optimizer, Orca, and WAN options for GemFire.
"We're going to hold back things like WAN connectivity. So a customer could scale out GemFire in a single location but when they want to do truly enterprise-class, global, distributed databases, they'll come to Pivotal for WAN connectivity," Cucchi said.
"HAWQ is very similar to Greenplum. It's the world's most advanced SQL-on-Hadoop solution and the reason it's the most advanced is that it's based on the same query optimizer and executor that's in Greenplum. So HAWQ will look very similar to Greenplum. The query optimizer will be held back and a couple of other enterprise-specific features. We're in the middle now of determining the exact specifics of this stuff."
Pivotal's Big Data Suite, now available in the cloud, on Cloud Foundry and later this year as a physical hardened appliance, is also adding several new data services including ingestion framework Spring XD, the Redis key-value store, and the RabbitMQ message broker.