Impala, Kudu, and the Apache Incubator's four-month Big Data binge

The last half of 2015 is shaping up to be a huge one for Big Data projects in the Apache Incubator

Just since August of this year, the Apache Incubator has welcomed, or been asked to consider, an array of new projects from the Big Data world. Beyond having Big Data in common, all of them were either commercial software products, or have been vendor-managed open source projects. And all the projects cluster around SQL-on-Hadoop, streaming data processing, machine learning, or some combination of those.

This leaves us with a great number of Apache Incubator projects that overlap, with each other and with certain Apache Software Foundation (ASF) top-level projects. Let's take some inventory here, so we can try and keep everything straight.

Will Cloudera's animals join the Apache menagerie?
Let's start with the biggie: on November 17th, Cloudera announced its submission of a proposal for Impala to join the Apache Incubator. Though Impala has been an Apache-licensed open source effort from the start, it's been overseen by Cloudera, and not the ASF. And typically, projects enjoy much wider adoption when their governance is ASF-managed, versus vendor-managed.

Still, Impala managed some broader adoption even as a Cloudera-managed project. For example, MapR includes Impala with its own Hadoop distribution and Amazon Web Services (AWS) added Impala to its Hadoop distro, part of its Elastic MapReduce service. Version 4.x of the AWS distro does not include Impala any longer, but customers can opt to deploy version 3.x, and include Impala on those clusters. And perhaps putting Impala under the auspices of ASF will bring it back to the latest and greatest versions of the Amazon distro.

A companion product for which Cloudera has also submitted an Apache Incubator proposal is Kudu: a new storage system that works with MapReduce 2 and Spark, in addition to Impala. Introduced in beta form by Cloudera in October, Kudu supports low-latency reads on data the way raw Hadoop Distributed File System (HDFS) access does, but also allows fast writes and updates, which in pure-HDFS-land could really only be handled by HBase. Interestingly, both Impala and Kudu feature column store technology.

Starting to look alike
There are so many SQL engines available for Hadoop, that moving Impala to the ASF, while important, is perhaps a less-desperately needed move. But getting a Big Data storage system that is suitable for both batch and streaming work to the ASF is a big deal, and will help get Kudu out of the ghetto of perceived vendor-specific software. That's especially appropriate for something as low-level as storage, which has to have broad support, or it won't get much traction at all.

If you didn't know, Impala and Kudu are both types of African antelope. Twenty years ago, I went to a game park in South Africa and saw a bunch of each. Their horns are pretty splendid. Hopefully, if the Impala and Kudu proposals are approved, Cloudera will lock horns less with its competitors over these two technologies. Then the ecosystem's only problem will be choosing between Hive, Impala, Spark SQL and Drill, all of which will be Apache projects.

And more
That ain't all. Remember the splash Pivotal made, when it introduced HAWQ, essentially an HDFS-based implementation of its Greenplum data warehouse platform? Well, the company open sourced it, as an Apache project, with acceptance into the Apache Incubator coming this past September. Pivotal also open sourced its MADlib machine learning technology, which works in an integrated fashion with HAWQ, and was also accepted into the Incubator in September. And if you think of the two projects as a matched pair, they sound similar to the combination of Apache Spark SQL and MLlib, don't they?

Now, before you really process that one, you should note another recent Incubator project: Apache Apex, "an enterprise grade native YARN big data-in-motion platform that unifies stream processing as well as batch processing." It's essentially the open sourced core of DataTorrent RTS. And that would seem at least to overlap with the combination of Spark Streaming and the core Spark engine.

So, under the Apache umbrella, we have at least four SQL-on-Hadoop engines and we've also got at least four streaming-plus-batch systems, if you consider Kudu and Apache Flink.

Competition is good, right?
The redundancy and competition are both drivers of innovation and fragmenting forces in the Big Data market. We have to ride out the wave, because once things shake out, the pace of improvements in functionality will be relatively slow. But only when that happens will enterprises truly sink their teeth into these technologies and apply them ubiquitously.

For now, the best we can do is track the projects and keep score. And if you can figure out an approach that abstracts away the differences between competing execution engines, then you may have a more reliable road map with which to go forward.