On and off, over the years, I have followed and written about the SQL-on-Hadoop saga. The adventure started with Apache Hive, which originally provided a SQL layer on top of MapReduce, bringing new usability to Hadoop, but little utility for interactive query scenarios. Things got interesting in the fall of 2012, when Cloudera introduced the beta release of Impala, its SQL-on-Hadoop engine that bypassed MapReduce completely, providing for true interactive query over Hive-compatible data on Hadoop.
A lot happened subsequent to that, but it can be pretty easily summarized as follows: (1) virtually every relational database and data warehouse vendor introduced an interactive SQL-on-Hadoop technology to query Hadoop data with its own query engine and (2) distributed memory and disk-based data framework Apache Spark became a thing, and the introduction of its Spark SQL module provided a way to query Hive-compatible data using its own processing substrate.
This left what appeared to be a trichotomy of SQL-on-Hadoop approaches: Hadoop-oriented Hive, standalone Impala and memory-oriented Spark SQL. But with the new releases of Apache Hive and Apache Drill, both of which came out on Tuesday, a new category is emerging, and other categories that looked mutually exclusive are starting to overlap.
So let's take stock of Tuesday's announcements and then piece together an appropriate new taxonomy of SQL-on-Hadoop technology.
Let's start with Apache Drill. I've written about this interesting open source project before. Largely driven by team members at MapR, Apache Drill is different from other Big Data SQL engines. Instead of working with schema-based Hive-formatted tables, along with some support for HBase data, Drill features a plug-in capable engine that at present can query schema-less files, JSON, Hive, HBase and even MongoDB data. And it can reach files stored locally, in HDFS, and cloud storage systems from Amazon, Microsoft and Google.
Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.
Add in the ODBC and JDBC drivers for Drill and that exploratory power gets extended to a wide array of business intelligence tools. And now that the official version 1.0 release has come, Drill may become strategic to a number of BI vendors. In fact, MapR has already worked with Information Builders, MicroStrategy, Qlik, SAP, Simba, Tableau, TIBCO and others to ensure compatibility between those vendors' tools and Drill, via its ODBC/JDBC drivers.
Meanwhile, back at the Hive, some very interesting things have been happening. First, the folks at Hortonworks have led an effort they call the "Stinger" initiative (hive, bee, stinger, get it?) to modernize Hive and transform it from a MapReduce-based, batch mode engine to one capable of interactive query as well. Working closely with other contributors, including significant participation from engineers on the Microsoft SQL Server team, Hive has been imbued with a number of enterprise relational database engine features.
But the interesting thing about Hive is that it's common to the entire Hadoop community. So while Hortonworks has felt strongly that improving Hive is a superior strategy to introducing new Hive-compatible engines, the company doesn't have exclusive control over the project. Case in point: the 1.1 release ushered in a major contribution from Cloudera that enables Hive to execute queries over Apache Spark.
Will the best SQL engine please stand up?
Not surprisingly, Cloudera is very excited about this project, whereas Hortonworks offers voluminous material explaining how Hive-on-Tez is much faster than Hive-on-Spark. Of course, Hive-on-Spark isn't really done yet, and Cloudera rates this release a beta. What's significant is that Hive 1.1 was the first release to include the Hive-on-Spark code and 1.2 buttresses that code; before Hive 1.1, it was available as a separate download installable as a "parcel" overlay for Cloudera's distribution of Hadoop.
Where does that leave Spark SQL? Honestly, it's hard to say. For its part, Hortonworks actually points out that Spark SQL is faster than Hive-on-Spark. But if customers need to choose between a SQL engine that runs only over Spark, and one that can do that and run over Hadoop MapReduce and Apache Tez as well, Spark SQL may have some explaining to do.
And then there are the Stinger project goodies in Hive, including a cost-based query optimizer, the LLAP ("live long and process" - I kid you not) layer that provides query caching services, and the vector processing logic, which processes multiple rows of data simultaneously rather than one-at-a-time.
A new taxonomy
So while the SQL-on-Hadoop world used to be divided into batch, interactive, and proprietary database vendor implementations, it's now about much more.
With Hive, you have enterprise RDBMS features and multi-engine execution. With Drill, you have SQL and BI tool-based access to data in a diversity of locations and formats, and the full range of degrees of structure.
Spark SQL has the advantage of integration with core Spark, Spark Streaming and Spark MLLib. Meanwhile, database vendors let you see Hadoop data as if it were conventional tables in your relational database or data warehouse.
Lots of options, to be sure. Now it's time to build a strategy, so you can move forward with your Big Data analytics work.
Correction: the initial version of this post indicated that Hive 1.2 was the first release to include the Hive-on-Spark code. In fact, Hive 1.1 was the first to include it.