A closer look at Microsoft Azure Synapse Analytics

How is Azure bridging the data warehouse and data lake experience, and where should they go with it?

synapse.jpg

Roughly six months after its unveiling at the Ignite conference last fall, Microsoft took a group of analysts and MVP professionals on a deep dive into the Azure Synapse Analytics service. As noted in Andrew's coverage last fall, Azure Synapse Analytics is a rebrand and evolution of Azure SQL Data Warehouse, broadening its footprint to span data warehousing,data lakes, and data integration within a single cloud service.

The guiding notion is getting as close as possible to a single source of truth, which in this case amounts to converging the data warehouse, data lake, and data integration. That's a challenge that's a lot harder than it sounds, as it not only brings together highly curated relational data with broader array of variable and semi-structured data, but it also means bringing together different groups of practitioners with skillsets, methods, and compute demands that are often diametrically opposed.

At one end, you have database developers skilled in working with SQL, whereas at the other end of the spectrum, data scientists and developers working off the data lake have typically worked with programmatic analytics in languages such as Python. Data warehouses, like any relational system, have typically been used for production and operational scenarios demanding reliable performance, and frequently, the ability to serve large population of users, while data lakes are more associated with experimentation with highly varied data sets and less predictable workloads serving a handful of end users.

So, the result is you have different workload characteristics, different data types, and different access patterns. That's the same rationale that spawned data warehouses years ago, as query and reporting workloads interfered with operational systems. But with Azure Synapse, Microsoft is seeking on the analytics side to bring the poles together. Although Azure Synapse is a generally available service today, the expanded platform is barely six months out of the gate. So, while Azure Synapse has the capabilities to support business analysts and data scientists, there are still more pieces to fall into place.

Let's start with keeping the lights on. Workload management has been a well-known issue for data warehousing for years – demand patterns for ad hoc inquiry, end of period reporting, and complex analytics are well-known, and for years, turnkey data warehouse system providers like Teradata offered a family of models optimized, respectively, for data-intensive, compute-intensive, high- or low-concurrency, and "balanced' workloads to maximize the output of compute resources.

When Hadoop came along, it was assumed that the brunt of workloads would be data-intensive, and so compute was moved to the data. Enter cloud-native, and the pendulum swung back to separating compute from data for economic reasons (analytic workloads often tend to be spikey, so why pay for compute you're not always using) with the high bandwidth of modern cloud backplanes addressed the data movement problem. Then came AI, which depending on whether it's machine learning or deep learning, has diverging resource requirements.

So, bringing the data warehouse together with the data lake is no mean feat. Azure Synapse has attacked the workload issue with a cloud-native architecture that builds on the separation of compute from storage in SQL Data Warehouse Gen 2 and extends this concept to heterogeneous SQL and Spark compute within a single service. For now, they are using Azure Data Lake Storage (ADLS) Gen 2, which is designed to deliver the economies of cloud object storage with the performance advantages of exposing the data through a file system API that is POSIX-compliant. Azure Synapse Analytics also offers a multi-level hierarchical cache within the SQL engine that automatically moves data between performance tiers (which includes disk storage and NVMe SSD caching) depending on the user workload, while Spark analytics runs on high-memory (8-GByte/node) instances.

Functionally, Azure Synapse Analytics starts by combining Azure Data Factory with Azure SQL Data Warehouse – the former is still available as a standalone service, while Azure Synapse supersedes the latter. And while it does not bundle Power BI or Azure Machine Learning into the same service directly, integrations are built-in at the metadata and user interface levels, so the flow is natural.

Azure Synapse uses the concept of workspace to organize data and code or query artifacts. And the workspace can surface as a low code/no code tool for business analysts or a Jupyter-like notebook for data engineers and data scientists to work in Spark or apply machine learning models. In the demos, Microsoft showed how the same data transformation task could be developed using both paths. There will be some differences in the experience – for instance, while Synapse inherits the Azure SQL Data Warehouse capability to support high concurrency, Spark environments have typically involved lone wolf data scientists or data engineers. There's also differences in levels of data security – practice is far more mature on the relational database side with table, column, and native row-level security, but not as mature on the data lake side. That's an area where Cloudera differentiates with SDX, which is available as part of its platform offerings.

Owing to the early stage of the Spark feature implementation, Python is currently supported, but R is not there yet. Given Python's momentum, that's probably not necessarily a show-stopper for most data scientists.

As this is a highly optimized platform, it's not surprising that Microsoft has added some customizations to its Spark and Jupyter-like Interact notebook implementations, and that not all Spark libraries are currently supported. Without diving into the weeds, Microsoft is looking to a more complete Spark implementation in Azure Synapse once Spark 3.0 comes out. Nonetheless, for data scientists and engineers who want the pure Spark experience, Azure Databricks will remain the better choice.

So, what's on our wish list?

For now, Azure Synapse Analytics operates on the notion of a single data lake composed of relational tables, folders and files of varying formats. In the future, we would like it to reach out to more data platforms in the Azure portfolio, as we view the data lake as being the collection of data, wherever it sits, in the enterprise. Towards that end, for Spark practitioners, we'd like to see first-party integration with Azure Databricks. There's room to expand supported compute instances, especially for AI workloads requiring GPUs or ASICs. We would also like to see a strategy for hybrid, where Microsoft already has a foot in the door with Azure Stack and Azure Arc. And we'd also like to see an Azure Synapse partner program that would provide tight integration and support for third party tools that could plug into the workspaces.

Oh, and one other thing. Today, Power BI and Azure Machine Learning are treated as ancillary services – as mentioned above, they are integrated into Synapse, but they are not bundled into the service. In the longer run, we believe that both services should be packaged as integral parts of Azure Synapse. Today, we believe that virtually all customers who use Synapse will also be using self-service visualization, whether it's with Power BI or third party tools like Tableau. On the other hand, today, that's not quite the case with machine learning, but we expect that to change pretty rapidly within the next couple years or less with internally developed or prebuilt third party models that will become ubiquitous. That's the handwriting on the wall.

This is not Microsoft's first stab at bridging the data warehouse and data lake. For on-premises, there was SQL Server 2019 Big Data Clusters, which placed a SQL Server engine on each node of a Hadoop cluster allowing the data lake (as originally defined by clusters with data stored in HDFS) accessible to SQL query. But Azure Synapse is a complete rethink. More than just making big data available to SQL and Python developers alike, it also changes the development environment by creating "workspaces." It addresses a broader chunk of the analytics lifecycle, from data ingestion, transformation, and integration all the way through self-service visualization and even collaboration by embedding Power BI reports into Microsoft Teams.

But more to the point, Azure Synapse reflects the fact that in the cloud, providers can break down the silos in the toolchain to present more unified offerings covering more of the lifecycle. Microsoft is hardly the only provider heading down this path. SAP Data Warehouse Cloud is taking a similar approach by integrating SAP Analytics Cloud to provide the self-service visualization last mile, while Oracle has begun publicly talking about extending the Autonomous Data Warehouse into a broader platform offering that, like Azure Synapse, would encompass more of the life cycle (we expect that Oracle Analytics Cloud integration to become a core component). So now we're waiting for the next shoes to drop from AWS and Google.