Almost seven years ago, in a hotel meeting room in Manhattan, Mike Olson, then Cloudera's CEO, briefed me on the still confidential Cloudera project called Impala. I think Olson knew he was preaching to the converted as he told me how inefficient and insufficient MapReduce-based computing was for the Enterprise. The answer, he said, was Impala, a Hive-compatible database that used Hadoop for storage but completely bypassed MapReduce for compute and processing.
A data warehouse in impala's clothing.
As I dug deeper, I found out there was more to the story. Impala wasn't just a MapReduce-less Hive. In fact, Cloudera said, it was actually an MPP (massively parallel processing) -based data warehouse that just happened to use HiveQL as its language and HDFS (The Hadoop Distributed File System) for storage.
Eventually, Impala went open source, first under Cloudera's own auspices and then under the Apache Software Foundation. So as Impala became the generic, Cloudera sought a brand name for the implementation of Impala in CDH, its own Hadoop/Spark distribution. That name became Cloudera Analytic Database.
But, remember, Impala is a true MPP data warehouse. So why beat around the bush? With that in mind, I suppose, Cloudera is today announcing the coming out of Cloudera Data Warehouse (DW), the Impala-based product formerly known as Cloudera Analytic Database.
In a conference call briefing, Cloudera's Joydeep Das, Senior Director, Data Warehousing Products and Susan Space, Senior Director of Corporate Marketing, explained to me that Cloudera DW is more than a branding exercise, and for a couple of reasons.
First of all, Impala is no longer tied exclusively to HDFS -- in fact, the product can use Amazon S3 or Microsoft's Azure Data Lake Store (ADLS) for storage. It can also use Kudu, Cloudera's own columnar storage layer (the nomenclature there is intentional -- impala and kudu are both species of antelope).
And when you add in other Cloudera and Hadoop ecosystem components, like Sqoop, Flume, Hue and Hive itself, you see why Cloudera feels it has an end-to-end solution for modern data warehousing on offer.
Head (node) in the clouds
The S3 and ADLS compatibility also means that Cloudera DW can run in the cloud -- and, in fact, it's been able to do so for some time, as long as you didn't mind doing so on an Infrastructure as a Service (IaaS) basis using cloud virtual machines. But Cloudera has had a Platform as a Service (PaaS) cloud offering for Hive and Spark, called Altus. So why not add the DW?
In fact, Cloudera is doing just that, introducing a PaaS version of Cloudera DW, called...wait for it...Altus Data Warehouse. As with Cloudera DW on IaaS, Altus DW will use the cloud storage layer, to allow compute and storage to be separately scaled...but the new PaaS offering will also relieve the customer of having to provision and manage the infrastructure.
Still a little trepidation?
In my briefing with Cloudera, I learned that the company is not targeting the Cloudera/Altus DW products at Enterprise data warehouse (EDW) scenarios. Instead, Das told me, the products are targeted at data mart-style implementations that are either departmental or scenario-based in nature.
Specifically, Cloudera is targeting three core use case categories:
- Optimizing existing Data Marts
- Working with non-transactional data, like log files and IoT sensor data
- Analyzing textual data in tandem with relational data, for example, doctor's notes and electronic medical records
Cloudera feels that implementations in the above three categories are where the growth in the market is. I might agree, and think that targeting them is not unwise. But I am still struck as to how, even with the product re-branded as a data warehouse, Cloudera is still de-emphasizing use of the product as an EDW.
Regardless of rhetoric, though, the scenarios above are well on the radar of cloud data warehouse companies like Snowflake, Amazon (with its Redshift product), Microsoft (with its Azure SQL Data Warehouse) and Google (with BigQuery). So whether we're talking marts or warehouses, Cloudera, the seminal Hadoop distribution vendor, is now a relational data warehouse contender.