With NYC Data Week and the Starta + Hadoop World NYC event, lots of Big Data news annoucements have been made, many of which I've covered.
After a full day at the show and series of vendor briefings this week, I wanted to report back on the additional Big Data news coming out with the events' conclusion.
Cloudera announced a new Hadoop component, Impala, that elevates SQL to peer level with MapReduce as a query tool for Hadoop. Although API-compatible with Hive, Impala is a native SQL engine that runs on the Hadoop cluster and can query data in the Hadoop Distributed File System (HDFS) and HBase. (Hive merely translates the SQL-like HiveQL language to Java code and then runs a standard batch-mode Hadoop MapReduce job.)
Impala, currently in Beta, is part of Cloudera’s Distribution including Apache Hadoop (CDH) 4.1, but is not currently included with other Hadoop distributions. Impala is open source, and it’s Apache-licensed, but it is not an Apache Software Foundation project, as most Hadoop components are. Keep in mind, though, that Sqoop, the import-export framework that moves data between Hadoop and Data Warehouses/relational databases, also began as a Cloudera-managed open source project and is now an Apache project. The same may happen with Impala.
MapR optimizes HBase, sets new Terasort record
MapR, makers of a Hadoop distribution which replaces HDFS with an API-compatible layer over standard network file systems, and which is offered as a cloud service via Amazon Elastic Map Reduce and soon on Google Compute Engine, introduced a new Hadoop Distribution at Strata+ Hadoop World. Dubbed M7, the new distribution includes a customized version of HBase, the Wide Column Store NoSQL database included with most Hadoop distributions.
For this special version of HBase in M7, MapR has integrated HBase directly into the MapR distribution. And since MapR’s file system is not write-once as is HDFS, MapR’s HBase can avoid buffered writes and compactions, making for faster operation and largely eliminating limits on the number of tables in the database. Additionally, various HBase components have been rewritten in C++, eliminating the Java Virtual Machine as a layer in the database operations, and further boosting performance.
And a postscript: MapR announced that its distribution (ostensibly M3 or M5) running on the Google Compute Engine cloud platform, has broken the time record for the Big Data Terasort benchmark, coming in at under one minute -- a first. The cloud cluster employed 1,003 servers, 4,012 cores and 1,003 disks. The previous Terasort record, 62 seconds, was set by Yahoo running vanilla Apache Hadoop on 1,460 servers, 11,680 cores and 5,840 disks.
SAP Big Data Bundle
While SAP has interesting Big Data/analytics offerings, including the SAP HANA in-memory database, the Sybase IQ columnar database, the Business Objects business intelligence suite, and its Data Integrator Extract Transform and Load (ETL) product, it doesn’t have its own Hadoop distro. Neither do a lot of companies. Instead, they partner with Cloudera or Hortonworks shipping one of their distributions instead.
SAP has joined this club, and then some. The German software giant announced its Big Data Bundle, which can include all of the aforementioned Big Data/analytics products of its own, optionally in combination with Cloudera’s or Hortonworks' Hadoop distributions. Moreover, the company is partnering with IBM, HP and Hitachi to make the Big Data Bundle available as a hardware-integrated appliance. Big stuff.
EMC/Greenplum open sources Chorus
The Greenplum division of EMC announced the open source release of its Chorus collaboration platform for Big Data. Chorus is a Yammer-like tool for various Big Project team members to communicate and collaborate in their various roles. Chorus is both Greenplum database- and Hadoop-aware.
On Chorus, data scientists might communicate their data modeling work, Hadoop specialists might mention the data they have amassed and analyzed, BI specialists might chime in about the refinement of that data they have performed in loading it into Greenplum, and business users might convey their success in using the Green plum data and articulate new requirements, iteratively. The source code for this platform is now in an open source repository on GitHub.
Greenplum also announced a partnership with Kaggle, a firm that runs data science competitions, which will now use the Chrous platform.
Pentaho, a leading open source business intelligence provider announced its close collaboration with Cloudera on the Impala project, and a partnership with Greenplum on Chorus. Because of these partnerships, Pentaho’s Interactive Report Writer integrates tightly with Impala and the company’s stack is compatible with Chorus.