Cloudera, the company behind the most widely deployed Hadoop distribution, did something surprising yesterday at Strata + Hadoop World, NYC. Instead of beckoning "old school" database and BI professionals (i.e. the majority of enterprise developers and DBAs) to move to Hadoop, it announced the beta of a new product, Impala, that brings Hadoop to them. Impala, part of Cloudera Distribution Including Apache Hadoop (CDH) 4.1, is a native SQL query engine that runs on Hadoop clusters, providing easy query access to raw HDFS data and to HBase databases.
Impala is a native SQL query engine that runs on Hadoop clusters, providing easy query access to raw HDFS data and to HBase databases.
The assumption that batch-oriented, MapReduce processing must be used to query Big Data has been shattered, and by the company that is arguably Hadoop’s staunchest advocate. Gone are the notions that enterprise skill sets are obsolete and that the command line is king. SQL, BI tools and Reporting are now Big Data technologies. Did Cloudera just blow your mind?
Also read: Cloudera aims to bring real-time queries to Hadoop, big data
Maybe you're skeptical. After all, Hive has provided a SQL Query abstraction – and BI tool compatibility – over Hadoop for some time, so why would Impala be significant? In fact why is Cloudera even bothering with it?
I interviewed Cloudera’s CEO, Mike Olson, who filled me in on Impala, in impressively technical detail. Here’s the deal: While Impala is, in fact, API-compatible with Hive and its ODBC driver, it’s nonetheless a completely different beast. Hive merely converts/compiles SQL queries into Java-based MapReduce code, which then runs in batch mode, just like other Hadoop tasks. Hive actually adds a step to MapReduce, while Impala replaces MapReduce.
It’s Pure SQL
Let me say that again: Impala is a native, distributed SQL Query engine that runs on Hadoop clusters and replaces Hadoop’s MapReduce engine. You still get Hadoop’s Distributed File System. You still get its physically distributed architecture. You still potentially get data locality, as the distribution of data across nodes doesn’t change – only the way it is queried does.
The assumption that batch-oriented, MapReduce processing must be used to query Big Data has been shattered.
Here come the BI Tools
Regardless of Impala’s degree of innovation, the ecosystem is already building. Yesterday, I interviewed Pentaho’s Co-Founder, Rich Daley, and its EVP of business development, Eddie White. They told me that Pentaho has been working closely with Cloudera to ensure Pentaho’s business intelligence tools work flawlessly with Impala.
Pentaho showed me their tools running on Impala, in a side-by-side comparison with Hive. In the demo, a specific SQL query was run on Impala, through a Pentaho reporting tool, and simultaneously on the Hive command line. The gentleman doing the demo for me got his data back from Impala, performed a number of reporting and data visualization tasks, and built me a completed report. Once he finished, the Hive version of the query -- running on the same data, in the same cluster -- was still running.
Next steps, and more
If you're interested in trying out Impala, Cloudera has a virtual machine image (in VMWare, Virtual Box and KVM formats), with the Impala beta pre-installed, available for download. And if you'd like to learn more without getting hands-on right away, Cloudera has also posted documentation.
I’ll have more posts coming about Strata + Hadoop World, and will also comment on some recurring themes I noticed at the show. Spoiler alert: one of those themes is the co-existence of SQL and MapReduce in the Big Data World, and Impala is a bold example of this phenomenon.