With NYC Data Week and the Starta + Hadoop World NYC event, lots of Big Data news annoucements have been made, many of which I've covered.
After a full day at the show and series of vendor briefings this week, I wanted to report back on the additional Big Data news coming out with the events' conclusion.
Cloudera announced a new Hadoop component, Impala, that elevates SQL to peer level with MapReduce as a query tool for Hadoop. Although API-compatible with Hive, Impala is a native SQL engine that runs on the Hadoop cluster and can query data in the Hadoop Distributed File System (HDFS) and HBase. (Hive merely translates the SQL-like HiveQL language to Java code and then runs a standard batch-mode Hadoop MapReduce job.)
Impala, currently in Beta, is part of Cloudera’s Distribution including Apache Hadoop (CDH) 4.1, but is not currently included with other Hadoop distributions. Impala is open source, and it’s Apache-licensed, but it is not an Apache Software Foundation project, as most Hadoop components are. Keep in mind, though, that Sqoop, the import-export framework that moves data between Hadoop and Data Warehouses/relational databases, also began as a Cloudera-managed open source project and is now an Apache project. The same may happen with Impala.
MapR optimizes HBase, sets new Terasort record
MapR, makers of a Hadoop distribution which replaces HDFS with an API-compatible layer over standard network file systems, and which is offered as a cloud service via Amazon Elastic Map Reduce and soon on Google Compute Engine, introduced a new Hadoop Distribution at Strata+ Hadoop World. Dubbed M7, the new distribution includes a customized version of HBase, the Wide Column Store NoSQL database included with most Hadoop distributions.
For this special version of HBase in M7, MapR has integrated HBase directly into the MapR distribution. And since MapR’s file system is not write-once as is HDFS, MapR’s HBase can avoid buffered writes and compactions, making for faster operation and largely eliminating limits on the number of tables in the database. Additionally, various HBase components have been rewritten in C++, eliminating the Java Virtual Machine as a layer in the database operations, and further boosting performance.
And a postscript: MapR announced that its distribution (ostensibly M3 or M5) running on the Google Compute Engine cloud platform, has broken the time record for the Big Data Terasort benchmark, coming in at under one minute -- a first. The cloud cluster employed 1,003 servers, 4,012 cores and 1,003 disks. The previous Terasort record, 62 seconds, was set by Yahoo running vanilla Apache Hadoop on 1,460 servers, 11,680 cores and 5,840 disks.
SAP Big Data Bundle
While SAP has interesting Big Data/analytics offerings, including the SAP HANA in-memory database, the Sybase IQ columnar database, the Business Objects business intelligence suite, and its Data Integrator Extract Transform and Load (ETL) product, it doesn’t have its own Hadoop distro. Neither do a lot of companies. Instead, they partner with Cloudera or Hortonworks shipping one of their distributions instead.
SAP has joined this club, and then some. The German software giant announced its Big Data Bundle, which can include all of the aforementioned Big Data/analytics products of its own, optionally in combination with Cloudera’s or Hortonworks' Hadoop distributions. Moreover, the company is partnering with IBM, HP and Hitachi to make the Big Data Bundle available as a hardware-integrated appliance. Big stuff.
EMC/Greenplum open sources Chorus
The Greenplum division of EMC announced the open source release of its Chorus collaboration platform for Big Data. Chorus is a Yammer-like tool for various Big Project team members to communicate and collaborate in their various roles. Chorus is both Greenplum database- and Hadoop-aware.
On Chorus, data scientists might communicate their data modeling work, Hadoop specialists might mention the data they have amassed and analyzed, BI specialists might chime in about the refinement of that data they have performed in loading it into Greenplum, and business users might convey their success in using the Green plum data and articulate new requirements, iteratively. The source code for this platform is now in an open source repository on GitHub.
Greenplum also announced a partnership with Kaggle, a firm that runs data science competitions, which will now use the Chrous platform.
Pentaho, a leading open source business intelligence provider announced its close collaboration with Cloudera on the Impala project, and a partnership with Greenplum on Chorus. Because of these partnerships, Pentaho’s Interactive Report Writer integrates tightly with Impala and the company’s stack is compatible with Chorus.
Talend and Simba go NoSQL
Talend, provider of open source data- and application-integration software announced its support for NoSQL databases HBase (yes, the very same database that MapR has optimized), Cassandra and MongoDB. The Talend support for these databases will be available next month as part of the upcoming version 5.2 release of its Open Studio for Big Data. Talend told me that support for additional NoSQL databases is bound to come. The company keeps an eye on community-contributed connector efforts, and takes it upon itself to fortify and harden the most popular ones, adding them to the core product.
Not to be outdone, Simba, promoted its Big Data ODBC drivers, supporting Hive (yes, that same layer over MapReduce that Impala emulates and outperforms), Cassandra and MongoDB, as well as Google BigQuery.
ODBC (Open DataBase Connectivity) is a 20-year old data access API standard from Microsoft, which is enjoying somewhat of a renaissance lately. ODBC defines both a standard database driver framework (supported by most query, reporting and BI tools and many programming languages) as well as a SQL grammar which the drivers will translate to the target database’s native language and commands. Simba’s Hive driver already ships as part of the Hortonworks and MapR Hadoop distributions, and the company announced that the Qubole cloud-based Hadoop platform will use it as well. But Simba’s Big Data drivers, procured directly, deliver ODBC compatibility to anyone, four all four databases.
Hortonworks racks up the partnerships
In additon to Qubole’s platform, Hadoop is available as a cloud service via Amazon’s Elastic MapReduce service, based on Amazon’s own Hadoop Distribution or the MapR M3 and M5 distributions. As I mentioned previously, MapR’s Hadoop distro will also soon be available as a service via Google Compute Engine. Microsoft’s Windows-based “HDInsight” Hadoop distribution, developed in concert with Hortonworks, reached a new milestone release in its by-invitation Beta on Wednesday, and will soon be publicly available on the Windows Azure cloud platform.
What about Rackspace’s cloud? And what about the Linux-based Hortonworks Data Platform Hadoop (HDP) distribution? Well, the two companies announced their products will be united to offer one more Hadoop public cloud service. But since Rackspace’s cloud is based on the OpenStack platform, which can also be implemented on-premise to build private clouds, HDP as a private cloud service is now possible as well.
LucidWorks, the commercial entity most supportive of the Lucene and Solr projects, announced the beta release of its LucidWorks for Big Data product. The cloud-based platform creates a unified RESTful API (Representataional State Transfer-based Application Programming Interface) around Hadoop and its companion components, like Pig, HBase and Mahout, oriented toward search-driven Big Data analytics.
Splunk, the Big Data company known for its wildly successful IPO, introduced the availability of its Hadoop Connect product (which integrates Hadoop with Splunk Enterprise) and Splunk App for HadoopOps (a Hadoop monitoring, troubleshooting and health analysis tool).
Still not enough for you? How about a couple of new database releases? Metamarkets announced it has open sourced its Druid in-memory streaming real-time data store, and Calpont announced that version 3.5 of InfiniDB, its Massively Parallel Processing (MPP) database, will reach GA next month.
In the my last several posts, I’ve summarized the huge array of Big Data announcements made in concert with this month’s Strata + Hadoop World NYC event. In future posts I’ll try to draw some conclusions about all the new products and initiatives that were released and announced.