Big Data upended the economics and architectural practices of enterprise data warehousing by not only making it cost effective to store and process more data and more varied forms of it, but also promoting new patterns that pushed analytics computing and data tiers together.
Now the cloud is prompting a shift of the pendulum back the other way. By decoupling data from compute, cloud Big Data services take advantage of object storage, which is far cheaper than HDFS file storage, and compute can be made elastic. While Amazon EMR allows customers the option to use HDFS, most EMR customers have embraced S3. Recently, Amazon customer FINRA collaborated to port AWS's HBase service to S3.
Yet paradoxically, few data warehouses have fully taken advantage of the cloud architecture. While Snowflake has lead out front with an elastic data warehouse, Amazon's own Redshift has until recently exclusively relied on local storage. The reasoning is that cloud object storage is not optimized for the type of performance that full-blooded databases deliver, because the data is optimized for durability rather than accessibility.
Nonetheless, although cloud storage was not designed for performance or accessibility, it is economical and convenient. With several new offerings, Amazon now makes S3 available for query with or without using Redshift. Athena is a serverless offering that lets you run SQL queries (using the Presto distributed engine) to query S3, while Redshift Spectrum treats S3 as external tables for a federated query approach.
Those offerings work fine if you're using AWS, but what if you're operating in hybrid mode, with your most sensitive PII data on premises and data from less sensitive sources storage and run in the cloud? That's the opportunity for what we used to call data integration middleware to step into the breach.
Enter Alation, which offers a catalog for data lakes that is built using crowdsourcing, natural language processing, and machine learning techniques for helping users discover and optimize how they query big data. For instance, Alation lets users search through plain business terms to find the right tables or topics, and then optimizes the building of SQL queries to get the data. Alation already searches Hive, and has integrations with Teradata to optimize federated query to Hadoop, and with Trifacta for coordinating cataloging and data wrangling (also known as data preparation).
This week, Alation is adding direct access to Hadoop's HDFS file system, Amazon's S3 cloud storage, and the Kylo data lake management open source project developed Teradata's ThinkBig consulting operation. And that comes atop recent support for KSQL, the SQL interface recently open sourced by Confluent to make Kafka Streams accessible to SQL developers.
The common thread behind these additions is that they open access to data that previously had required higher skilled developers using programmatic approaches via Java or any of the machine learning languages such as Python or R. For us, the S3 announcement is the sleeper; although as a data catalog, Alation overlaps with Amazon Glue, it provides a bridge to hybrid environments for federated queries spanning S3 to on premise clusters. While Alation lacks the ETL capabilities of Glue, it can provide a common view spanning the cloud and on-premises clusters, not to mention the higher level SQL interface lacking from KSQL.
It is one of the pieces that will allow organizations to tap the convenience and economies of cloud storage (and compute) without having to run all of their data lake in the cloud.