Azure Synapse Analytics data lake features: up close
Home screen
Though still in public preview, Synapse Analytics has added a slew of new data lake features features based on Apache Spark, to the platform.
But it's much more than that. With Synapse Studio, Synapse Analytics' browser-based development environment, a slew of capabilities come together. With the help of this tool, Synapse combines not only data warehouse and data lake; but also data engineering and data science; BI and AI; cluster computing and server-less computing; T-SQL and Spark SQL; notebooks and scripts; Python, Scala and C#.
I created this gallery for two purposes: as a show-and-tell for readers to understand the public preview features in the service, but also to structure my own learning and understanding of them.
All of the code and work here is based on examples from Microsoft, but the hands-on work and screenshots are original.
Home screen new object menu
From the home screen, you can directly create any of the major assets Synapse Studio lets you author. These include SQL scripts (against the warehouse or the lake), Jupyter notebooks (in a customized Synapse Studio experience), Spark batch job definitions, Azure Data Factory pipelines and Mapping Dataflows, and Power BI reports. You can also import (read: upload) existing scripts and notebooks.
Left navbar
The left navbar in Synapse Studio provides a good overview of the tool's capabilities, as well as those of the platform overall. With buttons for looking at data; doing development work; creating pipeline orchestrations; monitoring ongoing processes and managing assets (like server pools, linked services and orchestration triggers), Synapse brings together an enormous number of capabilities. And there's integration with Power BI and Azure Machine Learning, too.
Click through all the slides in this gallery for details.
Monitoring your SQL pools
From the Synapse Studio "Manage" screen, it's easy to see a list of all your Synapse SQL pools, including the server-less one for SQL on-demand. You can pause, create and delete pools here too, of course.
Databases view provides unified view of warehouse and lake data
Clicking the Data button on the left navigation bar displays all your data in Synapse, whether it be in a data warehouse (SQL pool) or the data lake (Spark). You can also create and open a starter notebook that queries one of your data lake tables by right-clicking it, then choosing "New notebook" and "Load to DataFrame", as shown here.
SQL on-demand
Synapse's SQL on-demand lets you query data in your data lake (be it in Spark databases or just sitting as a file in Azure storage) using the same Transact SQL (T-SQL) language used for SQL pools, Azure SQL Database and SQL Server.
The SQL script editor lets you run queries and view the result sets in tabular format or using the same native data visualizations available in Synapse notebooks. A query against data from the Twitter API, and a visualization of same, are shown here
Obviously, querying the data warehouse with T-SQL works too. Just pick your SQL pool (instead of "SQL on-demand") from the "Connect to" drop-down list.
Notebook
Another way to query your data is with code in notebooks. Synapse notebooks let you mix markdown text, code (in several languages) and data visualizations in one user interface. Python code and visualization output from the popular matplotlib library are shown here.
Notebook Options
Synapse Analytics notebooks provide a customized user interface skin over standard Jupyter notebooks. They can use Python, Scala, Spark SQL or even .NET/C# language kernels (C# is highlighted, although the code shown is in Python). Regardless of the default language for a notebook, any given cell within it can contain code in any one of the supported languages.
Notebooks can connect to any Synapse Spark pool, by choosing it from the "Attach to" drop-down list highlighted in this figure. Note the highlighted button at the top-right will create a new pipeline and add the notebook to it.
.NET Dataframes and UDFs
You can query the data lake in C# (highlighted in the Language dropdown list) using the .NET DataFrame library. You can even create user defined functions in C# and reference them when the DataFrame is created, as illustrated by the highlighted source code in this slide.
Data Factory pipelines
You can create Azure Data Factory (ADF) pipelines in Synapse Analytics too. Using Synapse Studio integration, it's easy to create a simple pipeline that will run the code in a notebook on a scheduled basis. To do so, just drag and drop the notebook from the object explorer onto the pipeline design canvas, as shown here, and then create a trigger by clicking the highlighted "Trigger" toolbar button. Click the "Publish All" toolbar button, highlighted at the top-left of the screen, to save and register the pipeline and trigger.
While not shown here, you can create ADF mapping data flows, and use ADF's Copy Data tool, too.
New linked service
You can connect to data in a myriad other platforms (including competing ones, like Amazon Redshift, selected here) by creating a Synapse Analytics "linked service." If you link to a Power BI workspace, you can create Power BI reports against data in Synapse, and even edit the report in Synapse Studio. (Report creation and editing are shown in the next two slides.)
Create a Power BI dataset from Synapse Studio
You can create a dataset in the Power BI cloud service, pointing to your data warehouse, directly in Synapse Studio. Simply hover over your SQL pool in the list and download a special .pbids file, which contains all the connection information you'll need. Next, double-click the file to create a new report against your warehouse in Power BI Desktop. Once you publish the report, the dataset will be created, and you'll be able to edit the report further right in Synapse Studio.
Editing a Power BI Report
Here's the Power BI editing experience. Put simply, the Power BI Web interface is hosted within Synapse Studio. But Power BI also gets its own node in Synapse Studio's Develop screen, along with scripts, notebooks and Spark job definitions.
Azure ML integration, via code
Beyond Power BI, Synapse Analytics also integrates with Azure Machine Learning (Azure ML). For now though, that integration is only through code. Not that there's anything wrong with that...Synapse's affinity (and Spark's) for Python code, and the availability of the Azure ML SDK for Python means it all fits together.
Here, you can see code in a Synapse Analytics notebook that uses the Azure ML SDK to perform an AutoML experiment. Notice the console output from Azure ML streams back into the notebook cell, shown in the bottom half of this slide. The Synapse Spark cluster itself is used for the compute.
Create a Spark batch job
In addition to notebooks and SQL on-demand scripts, you can run a job on your Synapse Spark cluster based on a Python script, a Scala jar file or a zipped up .NET Core package. You do this by creating a new Spark job definition in the Synapse Studio's Develop screen. Here we are creating a job to run the classic wordcount algorithm, using a Python script, that processes a file with text from Shakespeare.
The highlighted Submit button actually runs the job and, after it has been successfully submitted, a hyperlink (also highlighted) appears that, when clicked on, lets you monitor the job's progress in Synapse Studio's Monitor screen. You can see this in the next slide.
Spark job monitoring
Use Synapse Studio's Monitor screen to watch (or even "replay") the status of a Spark job. Note the highlighted "Spark history server" toolbar button that allows you to monitor the job in the Spark UI instead. Advance to the next slide in this gallery to see that.
Monitoring a job via the Spark UI
After clicking the Spark UI button from Synapse Studio's Monitor screen, the standard Spark UI appears. You can drill down on the tasks within your Spark job, open a specific task and further drill down on its event timeline, as we've done here.
Inspecting Spark job output
It's easy to see the results of your Spark job by viewing the contents of its Azure Data Lake Storage (ADLS) output folder.
For our wordcount job, we can use the ADLS blade in the Azure portal to view the output folder, the path to which is highlighted. Also highlighted are the zero-byte "_SUCCESS" file, and two output sequence files. Partial output from the first sequence file is displayed and highlighted in the middle of the screen.
Now create your own workspace
Ready to try this on your own? The new Synapse Analytics workspaces are in public preview and ready for you to provision on your own. Just head over to the Azure portal, create a new resource, search for "Synapse" and click on "Azure Synapse Analytics (workspaces preview)" to bring up the create screen shown here. Then go to Microsoft's "Get Started with Azure Synapse Analytics" page to get going.