A closer look at IBM AutoSQL

IBM is filling out its Cloud Pak for Data hybrid analytics platform with new capabilities reaching out to the data lake and laying the groundwork for governance.


As part of the flurry of announcements for this year's IBM THINK conference, IBM has announced a major update to Cloud Pak for Data that provides more capabilities for consuming and managing data across different steps of the lifecycle. As Larry Dignan covered it yesterday, AutoSQL is at the heart of the new capabilities for distributed query of data, expanding Cloud Pak for Data's reach beyond the original prepackaged data warehouse to cloud object storage and other relational databases.

AutoSQL is one of several capabilities being introduced to Cloud Pak for Data, alongside AutoCatalog, which automates data discovery and maintains a real-time inventory of data assets, and AutoPrivacy, that uses machine learning to automate the designation of sensitive data and facilitate enforcement of privacy protection policies. Putting this all together is an architectural pattern that IBM calls the intelligent data fabric that provides APIs and integration for automating the management of data from discovery to query and governance/compliance. By the way, don't confuse it with a similarly named service that schedules and automates queries to MySQL.

IBM describes AutoSQL as a universal query engine. The simplest way to describe it is as distributed query engine for your data landscape, or in essence, IBM's take on the data lakehouse. That is, data may reside in Cloud Pak for Data's own data warehouse or data lake, external data warehouses, or in object storage in other clouds. Queries are pushed down to the source, rather than the typical method of moving data to the database engine (which would otherwise run the queries). It accesses databases, data lakes (e.g., cloud storage such as S3, Azure ADLS, Google Cloud storage, or other object stores), and streaming data.

The guiding notion with AutoSQL is that you place a query and don't have to worry about how to configure it to run against data that resides in different environments, on-premises or in the public cloud, and can run at petabyte scale. IBM has worked in various caching and acceleration schemes to accelerate query processing. Optimization and tuning of queries is all inside a black box.

AutoSQL's query engine harvests capabilities consolidated from IBM''s roughly half dozen SQL engines that originated for Db2, Netezza, BigSQL for Hadoop, and others. And it offers the option for running batch jobs using Spark. In essence, this is IBM's answer to Azure Synapse Analytics, which also provides a choice of SQL or Spark processing. IBM differentiates AutoSQL, and cloud Pak for Data, as being cloud-agnostic, vs. Azure Synapse, which only runs on Microsoft's cloud.

Being part of the data fabric architecture, AutoSQL works with AutoCatalog. The discovery part means aligning the data with business terms – that's where AutoCatalog comes in. The catalog can generate data description language (DDL) that AutoSQL can use for generating data stores for persisting results. This can be critical, especially when working with public clouds that have data egress charges, where it might be more expedient to persist the results of frequent queries in an intermediate data store. In a multicloud environment, where you want to minimize data movement from different clouds, such data stores can become the physical instantiation of materialized data views.

AutoSQL is the first step toward expanding the reach for Cloud Pak for Data. It has extended the envelope from the embedded data warehouse of Cloud Pak for Data to the broader world of data lakes. By combining capabilities from IBM's various SQL engines and working in Spark, AutoSQL, IBM puts query of data warehouses and data lakes (cloud storage) on equal footing. We would like to see closer integration with IBM Event Streams – with the ability to run SQL query on real-time data in motion. Given that IBM is positioning AutoSQL as a way to scale query out to multiple data sources, we have a couple questions. What about querying data at the edge – whether the data resides in a local database inside a factory, or the aggregation of results of analytics conducted on remote devices out in the field? Or, flipping the equation, in a cloud-native world where compute is separated from storage what about multiple compute engines that are accessing the same data? These are the answers that we'll expect to hear from IBM in coming months.