It's been a terrible winter in my home town of New York City — and I'd rather be in California, where O'Reilly's Strata conference kicks off today.
As it happens though, I'm somewhere better still: Dorado, Puerto Rico on a family vacation. That notwithstanding, there's a ton of Big Data news today, in conjunction with Strata, so I felt compelled to roll up the announcements in a single post.
MapR, which along with Cloudera and Hortonworks, is a major Hadoop distro/vendor, has three announcements today.
First, MapR has revved its Hadoop distribution to incorporate Apache Hadoop 2.2, along with YARN, the new component which allows Hadoop to be used with data processing algorithms beyond MapReduce.
MapR will also allow the Hadoop 1.x and YARN schedulers to run side-by-side on the same nodes in the cluster, allowing customers to migrate more easily between Hadoop 1.x and 2.x, and permitting Hadoop 1.x-based third party services to run on Hadoop 2.x-based MapR clusters.
Second, in what seems like quite a coup, MapR is announcing support for HP's Vertica MPP data warehouse platform to be run directly on a MapR cluster. Because MapR's implementation of HDFS (the Hadoop Distributed File System) is based on a conventional read-write file system (and not the write-once, read-many architecture that standard HDFS implementations offer), Vertica can run on MapR and store and update its data, in native format, directly to MapR's strorage layer.
This provides a SQL-on-Hadoop solution not unlike that provided by Hadapt — which is not totally surprising given that Hadapt clusters often run MapR as their on-board Hadoop distribution. Of course, Hadapt and Cloudera have a partnership as well, which may, perhaps, become more important with MapR's announcement.
Finally, MapR has announced the release of its MapR Sandbox for Hadoop virtual machine image. Not unlike the Hortonworks Sandbox, MapR's namesake is a single-node full install of the MapR Hadoop distribution in the form of a ready-to-run virtual machine (VM) image that also includes multiple tutorials and other learning resources.
The Cloudera QuickStart VM is available for the same purpose (and was the first to market), which means that all three major distros are now available in this format, for free, making it far easier for developers and other professionals new to Hadoop to bypass what can be a complex installation regimen, and get hands-on with Hadoop right away.
A new version of Couchbase, the key-value/document store hybrid NoSQL database, built atop CouchDB, is being released today.
Update: Whie Couchbase derives some technology from CouchDB, it also incorporates tech from Memchached, and a lot of original code as well. Saying Couchbase is "built atop CouchDB" was therefore inaccurate.
Couchbase 2.5 includes new features aimed at Enterprise-readiness. These include Rack Awareness, with which the company says "the user can create logical groupings of Couchbase Server nodes and replica copies of the data are automatically distributed across server nodes located on different racks," and an enhancement to Cross-Data Center Encryption, which now allows data transferred over Wide Area Network (WAN) connections between data centers to be SSL-encrypted.
In its marketing and press releases, Couchbase seems to be gunning for MongoDB in the Enterprise NoSQL space (however big it may be). Expect the rigor of both products to increase as that competition gains momentum.
Alpine Data Labs is releasing the third version of its analytics workbench which, this time around, fully integrates the functionality of Chorus, an open source platform for social collaboration on data science projects.
The Alpine product, even without the Chorus functionality, is pretty impressive, as it provides an abstraction layer over various distributions of Hadoop (including Cloudera, MapR, Pivotal HD and vanilla Apache Hadoop); PostgreSQL; Oracle 11g and Exadata; Pivotal's HAWQ SQL-on-Hadoop offering; and its Greenplum MPP data warehouse appliance. Alpine allows for visual composition of predictive analytics dataflows and then generates and pushes down the appropriate MapReduce code or SQL queries, depending on the target data store.
While Chorus was formerly an EMC/Pivotal product, the OpenChorus initiative made it an open source project, the leadership of which has been turned over to Alpine.
As a result, the functionality of the prior Alpine product and Chorus have now been fused together, under the Chorus brand, in what Alpine's Chief Product Officer Steven Hillion assures me is a seamless whole. I don't know too many visual tools for predictive analytics beyond RapidMiner and Knime.
As neither of those products has a collaboration platform (which, among other things, makes data projects searchable by such attributes as the project name and the names of the project stakeholders, rather than just the data itself) built in, Alpine seems worthy of a deeper look.
Alpine also has an open architecture, accessible via a REST API.
Scottish biomedical analytics specialists Aridhia have used the API to integrate Alpine with the R statistical programming language. Community ingenuity such as this is very good, as the most useful add-ons may find their way into the core codebase of future releases.
Open source BI vendor Pentaho, which has been gaining traction with its Pentaho Data Integration (PDI) product, is announcing today that PDI is now integrated with Hadoop 2.0's YARN component and the Apache Storm real-time streaming data engine.
This YARN integration is a big deal for both YARN and Pentaho. It means that data integration of Hadoop can run in an efficient mode that bypasses the batch mode MapReduce algorithm. The Storm integration means Pentaho also brings real-time streaming data integration to the table.
Pentaho, which was one of Cloudera's first partners for the latter's Impala SQL-on-Hadoop solution, is wasting no time integrating with YARN, announcing support for it on the same day that MapR is announcing YARN's very inclusion in its Hadoop distro.
The keyword that ties these four sets of announcements together is "Enterprise."
Couchbase is adding rack awareness and encryption between data centers. MapR is integrating with Enterprise data warehousing platform Vertica. Pentaho is integrating data in Hadoop using a non-batch mode algorithm, something important to BI specialists used to working with modern BI platforms and their query times. Alpine Data Labs is making predictive analytics more visual and collaborative for Enterprise analyst teams.
None of this is first-generation raw processing capability. Instead, it's second- and third-generation fit and finish, with attention to Enterprise priorities, including fast, interactive access to data in Hadoop, visual tooling, social collaboration and security.
That we're not even a month and a half into 2014 is a great sign and raises chances that Hadoop will become bulletproof, mainstreamm, and embedded — or at least significantly more so than it has been.