Unless a "neutral" third party publishes them, we tend to view benchmarks as self-serving exercises that vendors typically stack in their own favor. But recent benchmarks issued by Cloudera and Hortonworks for their SQL on Hadoop engines point to something serious going on. In an era of Spark hype, SQL remains table stakes for Hadoop platforms.
Yes, you can perform machine learning, model customer ecosystems as social graphs, run streaming, and conduct sentiment analysis, but for most organizations, the first question they often ask is how fast is the interactive SQL. Using Hadoop only for SQL query might seem like a waste, given its appeal to R or Python developers. But getting buy-in requires satisfying the BI crowd, because in many organizations, SQL's the gateway drug to Hadoop.
And looking at the benchmarking press releases, you get a sense of who's afraid of whom. For Cloudera, it's Amazon. Competitive benchmarks pitted Impala 2.6, Cloudera's SQL-on-Hadoop MPP engine, against Amazon Redshift columnar analytic database. The results, announced a couple weeks back at Strata, showed Impala performing four to 10x faster on either S3 (which Redshift doesn't use) or EBS (which it does).
Cloudera is stating that now even a database that is decoupled from storage (Impala) can perform better than one that followed a traditional tightly coupled data warehouse architecture (Redshift). It's a shot across the bow, given if you want consistent SLAs, high concurrency, or support of very complex SQL syntax, conventional wisdom has been to use a database rather than Hadoop. Cloudera's results don't change that reality, but they do show results in the ballpark with Redshift. And they get the results using AWS's default S3 storage.
But Cloudera's underlying message is not just that Impala has been tuned to go faster. It knows that, while only a minority of customers are deploying to the cloud today, in the long term, the writing is on the wall. And it doesn't want customers bound for AWS to simply default to Amazon's own engine, EMR.
Cloudera is aiming to show that it offers more value (e.g., security and management features not matched by EMR), and that its performance is competitive with Redshift. It also wants to show it's more economical (it can use S3 storage), and deployment shouldn't be more complex.
So the Impala benchmark is just the latest proof point. Another is deployment.
Traditionally, if you deployed Cloudera on AWS, you had to configure and run it yourself, whereas Amazon offers EMR and Redshift as managed services that eliminate the need for managing physical deployment. Cloudera has countered with Cloudera Director to make deployment to Amazon as simple as if you were working with EMR. Point taken.
For its part, Hortonworks just released results this week aiming to (not surprisingly) one-up Cloudera. Using a range of queries, Hortonworks' release of Hive 2.0 with the LLAP (Long Live and Process) engine outperforms Impala on complex queries anywhere from 10 to 100x (for the record, Impala performed slightly better for queries processing in 30 seconds or less).
The takeaway is that Hortonworks finally has an answer to Impala. This has been an extended saga for Hortonworks, given that since Cloudera released Impala four years ago, well over a dozen open source and proprietary alternatives have hit the market. And so Hortonworks has been playing catch-up; it has used its later start to apply lessons learned with a modern caching and cost-based query optimizer. And it's found a way to work under YARN, despite the fact that YARN was never designed for continuous workloads.
The value of benchmarks is ephemeral at best, as they only capture momentary snapshots in time that, even if conducted fairly, grow irrelevant with the next release. Either way, Cloudera's and Hortonworks' benchmarks are a reflection of how critical SQL is for getting enterprises to buy into Hadoop.