IBM, Cloudera, Amazon announcements: Big Data news roundup

New products, new technologies and an acquisition announced over the last two weeks may be small items on their own, but they add up to an important shift in the Big Data space.

In the past couple weeks, several news items in the Big Data world have cropped up. Enough of them, in fact, to warrant a round-up and analysis.

IBM DashDB enhanced
I've written previously about cloud-based elastic data warehouse (DW) services from Amazon Web Services (Redshift), Microsoft (Azure SQL Data Warehouse) and Snowflake. But IBM has a DW service as well, called DashDB, to which a couple of important enhancements were announced recently.

For example, the service can now accommodate data warehouses up to 20TB (and IBM tells me that's a conservative number), whereas, before, even the highest service tier maxed out at 12TB. In addition, the product now offers compatibility with Oracle's PL/SQL dialect of the Structured Query Language (which is what SQL stands for -- if you didn't know). Together, these changes are clearly aimed at getting Oracle DW customers with wandering eyes to move to IBM's cloud.

DashDB is primarily based on IBM's veteran relational database technology, DB2. It also borrows in-database analytics technology from Netezza (IBM's Massively Parallel Processing DW platform). Now, combine that lineage and diversity with the fact that DashDB sits in the same part of IBM created from its acquisition of NoSQL database firm Cloudant in 2014 and IBM's announcement last week (covered by my ZDNet colleague Rachel King) that it has acquired Compose (formerly MongoHQ), a company which supports Database-as-a-Service offerings for NoSQL databases MongoDB, Redis and RethinkDB; relational database PostgreSQL; and search technology from ElasticSearch, and IBM becomes a company worth watching in the world of cloud database technology and service offerings.

Cloudera announces project Ibis: a scale-out Python implementation for Hadoop
While Cloudera has made waves with its adoption and endorsement of Apache Spark, it has nonetheless stood by its own Impala SQL engine, effectively a competitor to Spark SQL. And the Impala group at Cloudera is definitely not in maintenance mode; in fact, it's spreading its wings, and announcing an open source project that is bringing scale-out capabilities to the Python programming language.

Python was not designed for distributed application development, i.e. for applications that run across nodes in a cluster. Meanwhile, it's a popular language among data science types who need to work with large data sets that certainly can overpower a single server. This has forced those practitioners to use data sampling in their work, in order to build models, in a reasonable amount of time, on a single compute node.

Much as this same issue with the R programming language is addressed by technology from Revolution Analytics (now owned by Microsoft), which lets R run across nodes in a compute cluster, Cloudera's Project Ibis similarly enables Python. Cloudera takes a different architectural approach though: it has essentially created a Python abstraction layer over Impala and lets that engine, in turn, take care of the distributed processing work.

The technology comes as a result of Cloudera's acquisition of DataPad last year, and the latter company's Python-based framework for analysis of big data sets. As with Impala, Ibis is an open source project, but one governed by Cloudera itself, rather than the Apache Software Foundation. It will be interesting to see if the technology and its coupling with Impala (which, Cloudera assures me, is a loose one) resonates with the Python community. Regardless, the very premise of a Python syntactic layer over a SQL engine is intriguing.

Amazon Introduces Elastic MapReduce 4.0.0
Elastic MapReduce (EMR), Amazon Web Services' cloud-based Hadoop offering, has itself been updated. This new version brings with it not only newer releases of its constituent Hadoop components (including Hadoop 2.6.0, Hive 1.0, Pig 0.14 and Spark 1.4.1), but also configuration of Hadoop at the time of cluster configuration (rather than through a "bootstrap action" that runs after the cluster has started) and a new "Quick cluster configuration" option, presented by default, for cluster provisioning itself. Amazon has also made its use of directory paths and network ports consistent with Hadoop standards, which, the company says, will allow for faster integration of new Hadoop ecosystem components, and/or updated component releases, in the future.

Apache NiFi becomes top-level project
Apache NiFi, an integration workflow engine and Web browser-based user interface, which had been an Apache Software Foundation Incubator project, is now a top-level project. The project and its name emanate from technology created at the US National Security Agency (yup, the NSA) called Niagra Files.

I spoke last week with Joe Witt, who created NiFi in 2006, when he worked for the NSA. He is now both the "VP" of the Apache NiFi project and the CTO of Onyara, a company focused on extensions to NiFi. Witt assured me that although NiFi has a genesis in his work at the NSA, there is nothing "spooky or scary" about it.

Witt also explained that unlike many "lines and boxes"-type workflow systems, where workflows are designed first, then deployed and executed, NiFi's visual user interface facilitates interactive editing of executing flows. Witt likens this aspect of NiFi to molding clay, whereas others offer an experience more like 3D printing, where you design first, then render your object.

Speaking of other such systems, Witt made clear that NiFi is to be thought of more like an enterprise integration or service bus tool, and less like a data integration tool.

What it all means
Is any one of these stories big news on its own? No. But taken together, there's a lot here.

Suddenly, IBM will be offering at least seven novel cloud-based data engines, and has its sites set on Oracle with one of them. Cloudera is breaking past Hadoop, Spark and SQL to go after Python developers and data science folks. Amazon is making EMR (and thus Hadoop) much more plug-and-play than it had been. And the Apache Software Foundation now has a project that seeks to rival open source products like MuleSoft and commercial software from the likes of Tibco and even IBM, which takes us rather full circle.

What's especially interesting about all of these announcements is that they are signs of a maturing market. No one's introducing a brand new data processing engine. Instead, IBM is improving, acquiring and competing; Amazon is streamlining; Cloudera is re-platforming; and Apache NiFi is integrating.

It's that shift to evolutionary improvement that is the real story here.