Announced at the Build conference last week, Azure Synapse Link extends the cloud data warehousing service to operational data, starting with Azure Cosmos DB. Microsoft also claims Link works without impacting Cosmos DB. The answer on how it does that might surprise you.
To recap, the new Azure Synapse Link feature extends Synapse footprint, starting with Azure Cosmos DB; Microsoft intends to extend the capability to the SQL side later on including Azure SQL Database and Azure Database for PostgreSQL. While Synapse's original mission was to unite the data warehouse and the data lake, the new Link feature extends that out to operational data.
As Azure's newest data service (it was only announced for preview last fall), it's now is taking the lion's share of the spotlight regarding new announcements. And actually, aside from core features, such as the core data warehouse with provisioned compute; workload isolation within a single cluster; and materialized views, most of the differentiating features of Azure Synapse Analytics are still in public preview. In that group are Azure Synapse workspaces, the new web development studio; SQL Serverless; Apache Spark for Synapse; and pipelines that leverage existing technology for Azure Data Factory. This is by no means a finished production product, yet.
Although technically a rebrand of the old Azure SQL Data Warehouse, Azure Synapse Analytics is actually a different platform, as it reengineers the predecessor offering. As we've stated previously, Azure Synapse Analytics is part of a broader trend for cloud data warehouses to spread their footprint both upstream, for ingesting and transforming data, and downstream, for delivering analytics. It's a pattern that we're seeing SAP and Oracle also following.
Big on Data bro Andrew Brust post provided a good blow-by-blow description of the new release, but after reading the account, we still had a core question nagging us: How does Azure Synapse Link pull operational data without impacting the performance of Cosmos DB?
We originally thought that Synapse Link connected to Cosmos DB through federated query, where the query is submitted on Synapse, and the is processing pushed down to data at the source, with the tables in the remote database represented as virtual (or external) tables. That's how a number of data warehouses work when they reach out to other data sources, such as Amazon Redshift Spectrum querying data in S3, or how Microsoft SQL Server PolyBase, Oracle Big Data SQL, IBM Big SQL connect to data in cloud storage or other data stores, and so on.
But Azure Synapse Link is a different creature – it extends the data warehouse to include data that is populated from Cosmos DB via the equivalent of a change-data-capture stream. To activate it, once you have Synapse operational, you go into Cosmos DB and check the option to enable analytics.
That is, as operations are made to data in Cosmos DB, these changes are automatically pushed into a columnar optimized format for analytics without impacting the performance of the source system. Synapse can then query over this data entirely independent of Cosmos DB, but effectively in real-time since changed operational data is kept in sync with minimal latency.
The automatic change data feed approach taken by Azure Synapse Link is not that unusual – for on-premises databases. For instance, MariaDB's X4 platform automatically replicates row-based transaction to the column store, while Oracle database in-memory and IBM Db2 BLU acceleration let customers replicate to in-memory column stores selectively. But surprisingly, it's not yet widely done in the cloud; AWS and GCP require building of data pipelines to feed transaction data into their data warehouses.
So, until AWS and Google respond, that's where Azure Synapse Link differentiates. The upside is that Azure Synapse Link makes the data warehouse an operational system that can generate analytics on historical and current data. The flipside, however, is that you wind up paying to store the same data at least twice – not counting the number of replicas that are routinely maintained by each service. That makes Link well-suited for operational analytics, but not as cost-effective for ad hoc use cases where you might only want to pay by the query (e.g., how Amazon Athena works when running SQL queries against S3). That's where we'd like to see Microsoft add an option down the road for ad hoc query for scenarios that don't always require constant updating, or where data changes are not all that frequent.