This week was full of Big Data news, including new releases, a brand new product, a new acquisition and an update to one of the major Hadoop distros. Let's get a survey of what was announced and then see if we can't draw a conclusion or two.
A big driver for much of the news was this week's Amazon Web Services re:Invent conference, providing an opportunity for Amazon to reveal new stuff, and for partners exhibiting at the show to do likewise. Some of the news took place outside the re:Invent orbit but let's start there anyway.
Also read: Amazon Web Services meets the hybrid world
Perhaps Amazon's biggest data-related announcement was the General Availability of Amazon Athena, providing what me might call "SQL-on-S3-as-a-Service," which I guess would produce the acronym SS3aaS. While my nomenclature may be a little precious, it's also pretty self-explanatory. With Athena you can, on a rather ad hoc basis, query flat file data that you might have lying around in an S3 bucket, using standard SQL.
Athena turns out to be based on Presto, an open source SQL engine that can query many different data stores. The thing about Athena is, it's serverless...in fact, it's clusterless. So to run an Athena query, you don't spin up an Elastic MapReduce (EMR) cluster, or even an EC2 virtual machine, but instead head to the management console at https://console.aws.amazon.com/athena, set up a "table" by pointing to a file in S3, specifying its format (CSV, TSV, custom delimited, JSON, and columnar formats, Parquet and ORC) and its schema, then querying away.
I got Athena working in about two minutes, reading a sequence file from the (admittedly simple) output from the Wordcount Hadoop sample that I ran long ago on an old EMR cluster.
While it's annoying that I have to specify format and schema (for many files that's easily detectable, and Athena could have provided a default schema for me to accept or edit), it was still super-easy to use, with an otherwise friction-less startup.
That ability to query data you already have, with almost no setup or forethought, is the gist of Amazon's positioning for Athena. The idea here is that while you can already find similar capabilities in the likes of EMR or Redshift, those services require at least some planning as well as setup and startup time.
Maybe that struck a nerve, somewhat, with Bob Muglia, the CEO of Snowflake Computing, which has a data warehouse as a service offering that also happens to run on Amazon's cloud. Muglia, while seeing the upside of Athena as validation for data processing in the cloud, was perhaps a bit careful to advocate for a full data warehouse, rather than just a casual querying tool, saying: "Even as the number of data processing options in the cloud proliferate, the need for a true data warehouse has grown exponentially." As a querying tool it's good though, and Amazon announced that both its own QuickSight BI offering, as well as Tableau, are compatible.
Also read: Cloud data warehouse race heats up
Also read: Snowflake introduces multi-cluster data warehouse
Amazon had other announcements too, like the fact that Aurora, its MySQL-compatible managed relational database service, is now PostgreSQL-compatible as well. It also announced three new AI services: Lex, for natural language; Polly, for speech and conversations via voice or text; and Rekognition, for facial, object and scene recognition.
Treasure Data, which had a booth at re:Invent, used the event to announce its new Treasure Workflow facility. The workflows in this product manage data pipelines, including garden-variety extracts as well as a host of API-based data transfers from applications. Not only can Treasure Data pull data from major SaaS applications, but others, with whom Treasure Data has partnered, can proactively push data in the product.
This technique also works with Web and mobile apps developed by Treasure Data's customers themselves, with the injection of simple code that "phones home" and shares relevant data. This gives Treasure Data an Application Performance Management (APM) spin.
Not all news stayed in Vegas
Beyond the world of Amazon, MapR announced the release a new "Ecosystem Pack," adding support in MapR Streams, for Kafka REST API and Kafka Connect compatibility; the addition of Spark 2.0.1 and Drill 1.9; and Installer Stanzas, which enable API-driven installation of MapR clusters on-premises or in the cloud.
A Birst of new features
Cloud BI provider Birst announced its new Birst 6 release. This release follows an important market trend: inclusion of data preparation functionality inside a core BI product. Referred to as "Connected Data Prep," Birst offers a self-service approach that divides the work into three steps, which the company has named "Connect," "Prep," and "Relate," and which includes machine learning-assisted transformation and joins.
And the machine learning doesn't end there; in fact Birst has added "Machine Learning Automation" to the product that includes prescriptive analytics and what Birst calls "One-click prediction" capabilities. Birst has also added various performance enhancements under the umbrella of what the company is calling "Cloud Scale Architecture."
And more consolidation
Last, and in no way least, the Big Data world heralded a new acquisition. Big Data ETL-oriented Syncsort (which itself was acquired by private equity company Clearlake Capital in October of last year) has announced its acquisition of data quality specialist Trillium Software.
Just as BI vendor Birst has integrated data prep into its product, it would seem here that we have a vendor specialized in industrial strength ETL and data prep moving to integrate data quality capabilities in its own suite of products. Clearly, siloed functionality is on the wane, and integrated capabilities are on the rise.
All together now
In fact, if you take a look at Amazon's announcements, you'll see adherence to that same trend: by, effectively, including SQL querying capabilities in its S3 cloud storage, and adding Postgres compatibility to Aurora, Amazon's trying to keep you engaged by not making you go somewhere new for the capabilities you're seeking out.
Why go to spin up an EMR cluster, fire up Hive and write your own CREATE TABLE command, when you could just switch to the Athena management console and then point, click and query? Why go to some separate service to get a self-managed Postgres instance up and running (or do it yourself on an EC2 virtual machine) when Aurora (which also integrates with S3) has got you covered, and on a SaaS basis.
This is how data gets powerful. When the path to querying it and analyzing it is short, and can be traversed on whim. Users get more "insights" when they ask more questions. And when the disincentive to ask those questions melts away, more questions get asked. It really is that simple.