The first generation of Azure SQL Data Warehouse (SQL DW) was announced in 2015, and SQL DW "Gen 2" reached general availability in 2018. Today, at its Ignite confab on Orlando, Microsoft is announcing Synapse Analytics, essentially the third generation of SQL DW, along with new capabilities in preview. In general, Synapse Analytics seeks to unify an array of analytics workloads, including data warehouse, data lake, machine learning and the data pipelines that act as the mortar between those bricks.
Break it down for me
In a briefing with ZDNet, Daniel Yu, Microsoft's Director Products - Azure Data and Artificial Intelligence and Charles Feddersen, Principal Group Program Manager - Azure SQL Data Warehouse, went through the details of Microsoft's bold new unified analytics offering. Based on that briefing, my understanding of the transition from SQL DW to Synapse boils down to three pillars:
- The core data warehouse engine has been revved, with new features to compete with other cloud data warehouse platforms, including the ability to accommodate workloads through explicitly provisioned or on-demand (serverless) infrastructure, each with its associated pricing model
- The integration of Apache Spark (the open source flavor, and not Azure Databricks) and Azure Data Lake Storage (ADLS) to accommodate data lake workloads
- A unified Web user interface, called Azure Synapse studio, the provides control over both the data warehouse and data lake sides of Synapse, along with Azure Data Factory, to accommodate data prep and data management
Spark integration, and more
The integration of Apache Spark seems to be more than just a "bundling" of the open source big data analytics framework. For example, when a Synapse cluster is provisioned, ADLS capacity -- which can store Spark SQL tables -- is requisitioned along with it (as is Azure Data Factory). Spark SQL tables are immediately query-able from the SQL-Server based T-SQL language, without first requiring explicit commands like CREATE EXTERNAL TABLE. The engine these queries leverage apparently integrates natively with data files stored in Apache Parquet format.
Such a feature will serve as a close competitor to Amazon Web Services' Athena service, which provides SQL query over data in S3. Beyond that capability, however, Azure Synapse studio integrates a notebook experience, ostensibly accommodating the development and execution of Python, Scala and native Spark SQL code blocks. Spark integration also means that Synapse can handle machine learning workloads, by virtue of Spark MLlib.
Beyond Spark ML, Microsoft is also discussing integration with Azure Machine Learning, Power BI, Azure Data Share and applications/services that support the Open Data Initiative (based on Microsoft's Common Data Model), though with fewer specifics. Those integrations will likely gel over time, and while the Synapse brand launches today, the new features that accompany it are being rolled out only in preview form.
A fork in the SQL Server-Spark road?
Interestingly, the on-premises SQL Server product, from whose engine and Transact SQL language Synapse Analytics can trace its heritage, is also launching a new version today (SQL Server 2019 -- which I cover in a separate post) that, with a feature called Big Data Clusters (BDC) also integrates Apache Spark, and data lake workloads. And despite SQL Server's on-premises identity, BDC is completely based on Kubernetes container orchestration, which is implemented particularly well by Azure Kubernetes Service (AKS).
Effectively, this means Microsoft is, on the same day and at the same event, launching two new options for combining SQL Server technology with Apache Spark, and both can run on Azure. Meanwhile, the two are implemented differently. And while Synapse has its Azure Synapse studio, SQL Server 2019 offers a notebook-capable, cross-platform (Windows/macOS/Linux) desktop user interface for database and data lake workloads, called Azure Data Studio.
This bifurcated path for Spark integration and tooling is bound to cause customer confusion, unfortunately. And the offering of yet another Apache Spark implementation on Azure, separate from Azure Databricks, may pose difficulties of its own, especially since Microsoft lists Databricks as one of its partners for Synapse.
There are important differences between all these services, though. SQL Server is geared primarily towards OLTP (Online Transactional Processing) requirements; Databricks shines in the realms of data engineering and machine learning; Synapse is the service you'll want if MPP (massively parallel processing) data warehouse analytics are front-and-center for your needs. The fact that Spark and data lakes cut across all three of these just shows how important that technology and analytics model, respectively, have become.
Brust is a Microsoft Data Platform MVP and has done work for the Microsoft Advanced Analytics team.