Cloudera wants Spark and Hadoop to be one platform, that works

Apache Spark has gained huge industry support, while continuing to have deficits in enterprise readiness. With its One Platform Initiaitive, Cloudera has concretely declared its intention to remedy those shortcomings.

Cloudera proclaimed some time ago that it saw Apache Spark as the future of Big Data. It predicted, and committed to help bring about, a world where most Hadoop ecosystem components would run on the memory-centric Spark processing engine and would rid themselves of their dependency on MapReduce.

Since that time, the Spark project has enjoyed huge industry adoption. Products like ClearStory Data and Paxata use Spark as their native engines. IBM announced its own $300 million commitment to Spark -- including dedication of 3500 researchers and the establishment of a Spark Technology Center in San Francisco -- at Spark Summit this past June. And just last week, SAP announced its own Spark-based HANA Vora technology.

That's all well and good, but criticism of Spark, as not being ready for production in the enterprise, has persisted. I myself have heard issues of scale, fault tolerance and prevention of data loss raised, as well as gripes around lack of stability in general. Most vendors have soldiered on, subscribing to a generally optimistic belief that the kinks will be worked out.

Just do it
Someone needs to take action though. On Wednesday, Cloudera announced the One Platform Initiative which very specifically sets out to address Spark's shortcomings with an eye toward making it not just robust and reliable, but the primary execution engine in the Hadoop ecosystem. If Cloudera has its way, every new Hadoop project will use Spark and dispense with MapReduce.

I spoke to Eli Collins, Cloudera's Chief Technologist, who got pretty specific about what Cloudera is setting out to do. He laid out the general "pillars" of the One Platform Initiative: improvements to Spark's management interfaces, its security, its scalability and its streaming data capabilities.

The One Platform Initiative seeks to integrate Spark much more deeply with Hadoop. Cloudera wants Spark to run on Hadoop's YARN resource management layer more adeptly and to take far greater advantage of Hadoop's Distributed File System (HDFS).

The to-do list
Cloudera says it has already brought capabilities to Spark like data locality (where compute nodes operate on data that is already stored locally), integration with HDFS caching and, for better perimeter security, integration with Kerberos as well.

Now Cloudera wants to do things like improve Spark's Web user interface for a better debugging experience and add auto-tuning of job parameters, based on data volume changes and available cluster resources. It also wants to integrate Spark with Cloudera Manager and Cloudera Navigator.

Vested interest
Cloudera knows that a snappy, memory-biased system like Spark is necessary to capture the attention of a market accustomed to OLAP and Data Warehouse systems. Such technologies, though focused on smaller data volumes than Hadoop, are nonetheless analytics-oriented and more responsive than batch-based systems like MapReduce.

But Cloudera also knows its customers need technology that has the kind of fit, finish, scalability and reliability that mere early-adopter technologies don't provide. These customers also need to integrate with the Hadoop storage media and ecosystem components they have already invested in. The One Platform Initiative is very prudently focused on advancing Spark's functionality and Hadoop integration, to satisfy those very enterprise customers.

The One Platform Initiative's goals are unassailable and the public commitment to them is helpful all by itself. It's really one of the more sensible initiatives to come out of a big data vendor. The industry, having backed Spark almost religiously, needs the One Platform Initiative to be successful. Chances are good that it will be.