Today's the second day of Strata + Hadoop World in New York City and there's already a glut of announcements. But when one of the biggest brands in the Hadoop market delivers a keynote address at the conference, and uses it as the vehicle for its own announcements, you can bet there will be some important stuff in there. Last year, the Palo Alto-based company announced its Impala SQL-on-Hadoop engine...and this year almost every other vendor has an offering in that space, in response. So what's happening now?
This year, Cloudera's announcements are partly about product, and partly about business. Given that Mike Olson, the company's founding CEO, has shifted to Board chair and Chief Strategy Officer, and that Tom Reilly, former CEO of ArcSight, has now taken the chief executive reins at Cloudera, this makes sense. The company can now focus on business expansion and product strategy, in parallel.
Red meat for the feature-hungry
On the product side, Cloudera is announcing today the Beta release of version 5.0 of its flagship Cloudera Distribution including Apache Hadoop (CDH -- whose GA is expected very early next year). CDH is built atop the Apache Hadoop 2.0 GA release that I covered the week before last, and it adds some special Cloudera sauce, to boot. Among the new features:
- The ability to pin data from the Hadoop Distributed File System (HDFS) in memory
- A new release of Impala, v1.2, now supports user-defined functions (UDFs) written in Java and, in Mike Olson's words, "other scripting languages"
- Role-based security features to data retrieved through Apache Hive, via the inclusion of Apache Sentry, an Apache incubator project launched by Cloudera
- Inclusion of data lineage and auditing via the inclusion of Cloudera Navigator
- New APIs and a plug-able architecture for Cloudera Manager, allowing the tool to deploy/configure/manage third party products on the Hadoop cluster. SAS is already on-board to be Cloudera Manager-accessible; other products will likely follow
Again, CDH 5 is based on Hadoop 2.0, which includes the YARN component that allows Hadoop to work independent of the MapReduce data processing algorithm. YARN delivers various performance gains for any distro that uses it, including CDH 5.
Business is a feature
On the business side, Cloudera has announced several things. One such initiative, announced today, is a branding initiative, but that's hardly a trivial matter. Here's the gist: Cloudera is now referring to its Hadoop offering as an "Enterprise Data Hub." In doing so, the company is staking a claim that Hadoop isn't just a data scientist sandbox anymore.
In fact, and in the spirit of NYC DataWeek, we might characterize Cloudera's take as Hadoop being the Ellis Island of data, an intake center if you will, where said data can be cleansed, shaped, aggregated, queried, indexed and searched, before heading elsewhere. And in some companies, Hadoop isn't just Ellis Island, it's Manhattan -- where data comes to reside, and get monetized.
Swing you partner
Speaking of monetized, Cloudera is announcing two new partner initiatives. The first is Cloudera Connect: Cloud, which will facilitate letting CDH customers decide just where to run their Hadoop cluster. That location could be with cloud and hosting providers like Softlayer, Verizon Business, Savvis, or T-Systems, who can offer what Cloudera is (not surprisingly) calling its "Data Hub as a Service." The program will also facilitate smooth operation in the private cloud, through cooperation with VMWare and OpenStack.
The second initiative, Cloudera Connect: Innovators, acknowledges the innovation on the Hadoop stack that occurs outside of Cloudera's walls and looks to incorporate the best of those external projects in to CDH, in the name of adding more spokes to the (Enterprise Data) Hub. In this program, Cloudera will identify projects that are novel and interesting and bring them into the fold.
The inaugural on-board Cloudera Connect: Innovators partner is Databricks, the commercial entity behind the open source Apache Spark and Shark projects. The former is an in-memory processing engine that rides atop the Hadoop cluster; the latter is a Hive-compatible SQL engine interface to the former.
As a final business initiative, Cloudera is announcing that it will no longer provide support agreements for individual components of the Hadoop stack (e.g. MapReduce and Hive, but not Pig or HBase). Instead, Cloudera Enterprise and support of it will be inclusive across the entire CDH platform, so that the Enterprise Data Hub can live up to its name, with all points of entry and egress available.
Eyes on the (enter)prize
Pinning data in-memory, adding security, data lineage, 3rd party integration into its management interface, bringing Spark and Shark onto the CDH train, working closely with cloud providers, and bringing in a seasoned CEO. Looks to me like this is big, "aim-high" stuff.
When I spoke to CEO Tom Reilly last week, he confirmed that for me, citing as Cloudera's two biggest competitors, not Hortonworks and MapR, but EMC/Pivotal and IBM Information Management, instead. Reilly gave a nod to Oracle too, once I followed up by asking him if Larry Ellison's shop was in Cloudera's sites as well (lest we forget that Mike Olson was a VP at the Redwood City, CA firm).
In any case, none of this should be surprising, really. This is where the Hadoop world is going: the Enterprise, or bust.