With cloud storage becoming the de facto data lake, it's almost quaint to recall that barely a few years back, SQL-on-Hadoop was a hotly-contested battleground, with well over a dozen different open source and proprietary engines duking it out. With the dust settling, as the Hadoop market winnows down with the combo of Cloudera and Hortonworks, all eyes are on how to access the growing masses of data sitting in cloud object storage. Today, each of the cloud data warehouse platforms provide ways to federate query to cloud object storage.
But what about if you don't want to mount a Hadoop cluster or set up a data warehouse? A couple years back, AWS opened the doors with Athena, which directly queries S3. Beneath the hood, Athena uses Presto. That's the interactive SQL query-on-Hadoop technology developed by Facebook that, for a while, had the distinction of being one of the only such frameworks that didn't have a major vendor behind it. It wasn't Impala, which had Cloudera behind it; BigSQL a Db2 product from IBM; HAWQ from Pivotal; nor was it the beefed up version of Hive from Hortonworks. Translated? If you used Presto, you were on your own.
Teradata, which acquired Hadapt, began to fill that vacuum before it spun it off – the company, now rechristened Starburst Data, wanted to be free to pursue midmarket firms outside of Teradata's core market.
Reflecting the fact that the big data world still includes, but has also grown broader than Hadoop, you don't see a lot of benchmarks these days comparing SQL on Hadoop frameworks. Given that both were developed based on the Google Dremel project (now publicly available as the Cloud BigQuery data warehousing service), Presto is often compared to Impala. There are claims that Impala is still faster at individual queries. But with its roots as the internal big data query engine for Facebook that was used by thousands of users, Presto's strength has been in high concurrency, as tests against Apache Spark have revealed.
More to the point, Hadoop is still very much part of the big data picture, but so is going against cloud storage. The Apache Hadoop community is working to make cloud object storage a first class citizen on par with HDFS, but as Mike Olson recently commented, the Hadoop community is still waiting for that definitive answer to AWS S3-compatible storage on premise.
In life after Teradata, Starburst Data is positioning itself as a federated query provider. Yes, Teradata still resells to its client base, but more often, Starburst Data is going up against better-financed rivals like Dremio. Instead of taking venture money, Starburst has been bootstrapping its growth and, miraculously for such an early stage, is already in the black. Compared to Dremio, which is branching out into data catalogs and Kubernetes support, Starburst is sticking to its knitting with security, usability, and performance. Its current release, announced today, is adding a new "Mission Control" console that will facilitate connecting Starburst to different data sources.
Also: Containers: A cheat sheet for tech pros TechRepublic
While Starburst is positioning itself as being cloud- and database-agnostic (it has far more connectors than Impala, for instance), its sweet spot will be providing a third-party alternative to AWS Athena. And in so doing, it should probably take a cue from Dremio and add containerization and Kubernetes support to its roadmap. It also faces competition from AWS. While Starburst claims a performance edge on Athena today, its entire runtime (including the recently-introduced query optimizer) is all open source. Amazon could readily get its hands on that same technology, meaning that making up the performance gap could just be a matter of time. No matter, one of AWS's marquee customers that has migrated many of its data platforms to the Amazon stack, remains one of Starburst's steadfast customers when it comes to querying S3.