While everyone may be swooning over Apple's Worlwide Developer Conference in San Francisco, the Big Data world is gathering for its own confab, Hadoop Summit, run by Hadoop distribution vendor Hortonworks, and hosted at the bottom of the peninsula, in San Jose.
The event runs today through Thursday and it will definitely drive a news cycle. In fact, some news is already breaking this morning.
Despite the fact that Hadoop Summit is run by Hortonworks, fellow Hadoop distribution vendor MapR had several announcements of its own in conjunction with the event, all bundled up with the release of MapR 5.0
To begin with, MapR announced that the new release includes new real-time replication features, including real-time integration between MapR-DB (the company's HBase-compatible operational database) and Elasticsearch. Such real-time integration assures that search indexes in Elasticsearch are kept up-to-date as operations occur, rather than relying on a batch update process that creates latency in search results unacceptable in many business scenarios.
MapR also announced that Drill v1.1, which will be part of MapR 5.0, will include a new secured views feature that allows for views of tables containing only certain columns, and those views can be assigned to specific users and groups. By creating these views and removing access to the underlying tables, MapR customers can make certain that particular columns in a table won't be visible to certain users. This is especially important for tables that may contain PII (personally identifiable information). Views can also be used to publish custom data sets (containing specific rows of data, custom columns and/or custom joins) to certain users or groups.
Disclosure: I work for Datameer, a company that has partnered with MapR around its inclusion of these governance features.
In addition to Drill 1.1, MapR 5.0 includes Hadoop 2.7, Spark 1.3, Hive 1.1, Sentry 1.5 and Sqoop 1.99.5.
Auto-Provisioning Templates, or: why should cloud users have all the fun?
One more piece to MapR 5.0 is the inclusion of Auto-Provisioning Templates. Effectively, these templates provide for a browser-based, step-by-step interface for provisioning a Hadoop cluster and specifying which license you'd prefer to use and which precise components you'd like included.
This is quite similar to the Wizard-driven user interface Amazon offers for Elastic MapReduce, allowing you to select whether components like Hive, HBase and Hue should be installed onto the cluster. Such an interface is very helpful, and there's no reason why it should be limited to cloud-based clusters. It's nice to see a feature like this for on-premises Hadoop clusters.
The new release now facilitates cloud-based Hadooop integration, with support for Amazon Elastic MapReduce. Integration with SAP HANA is included too. Support for Apache Spark has been added to Pentaho Data Integration (PDI), allowing PDI to orchestrate Spark jobs. The look and feel of PDI has been updated as well, says Pentaho. And new APIs have been added to make Pentaho an even stronger solution for embedding BI features into custom and commercial applications.
More news will likely be forthcoming from the Hadoop Summit event. I'll be on-site for the event's entirety, to sop it all up.
Disclosure: I work for Datameer, a company that has partnered with MapR around its inclusion of data governance features in MapR 5.0.