'

Databricks comes to Microsoft Azure

The premium implementation of Apache Spark, from the company established by the project's founders, comes to Microsoft's Azure cloud platform as a public preview.

Databricks, the company founded by the creators of Apache Spark, first launched its cloud-based Spark services to general availability in 2015. It was a single cloud offering, from Databricks itself, but physically based on the Amazon Web Services cloud.

On the Azure side, meanwhile, there have been several ways to run Apache Spark, including on HDInsight, Azure Batch Service, Data Science Virtual Machines and, more recently, Azure Machine Learning services. But if you wanted full-on Databricks, you had to do that on AWS.

Redmond-bound
Enter Azure Databricks (ADB), a new flavor of the premium Apache Spark service but this time based on, and tightly integrated with, Microsoft Azure. ADB has direct support for Azure Blob Storage and Azure Data Lake Store, and its otherwise standard documentation has been customized to illustrate how to connect to Azure SQL Database and SQL Data Warehouse, and to connect to the service from Power BI. It also integrates with Cosmos DB and Azure Active Directory.

The integration is so tight that although the Databricks product itself comes from a third-party, the service is in fact a first-party offering from Microsoft. So rather than procuring it via the marketplace, you instead provision it as you would other services with the Azure brand and Azure's Enterprise-grade SLAs apply to the ADB service.

adb-slide-borderless.png

How Databricks fits in with the overall Azure data stack

Credit: Microsoft

Azure Databricks features a notebook-based collaborative workspace (details of which are discussed below), the Databricks Runtime (a highly optimized version of Apache Spark), and a serverless compute model, which avoids the detailed configuration work normally associated with managing Spark.

Azure Databricks is different from other Spark implementations because the environment itself is decoupled from any instantiated Spark cluster. Instead of firing up and paying for cluster resources and then getting your work done, you instead have a design-time experience within a Databricks workspace and, when ready, you can start up a cluster to execute the work.

azure-databricks-workspace-dashboard-cropped.jpg

Azure Databricks workspace dashboard

Credit: Microsoft

Take out your notebooks
Much of that work gets done in Databricks notebooks. These are similar in concept to Jupyter notebooks which, in fact, can be imported into Databricks notebooks (I did this myself and can confirm that it works) or created via an export process.

Databricks notebooks can be used and shared collaboratively and may contain code in any combination of supported languages, including Python, Scala, R and SQL, as well as markdown text used to annotate the notebook's contents.

The code cells (sections) of the notebooks can be executed interactively. When notebook code (especially SQL queries) returns tabular results, these can be visualized as charts. A notebook with a number of charts and some markdown can be alternatively rendered as a dashboard.

But notebooks can also be considered production executable packages. Notebooks can reference and run other notebooks, and they can also be run as full-fledged jobs, on a scheduled basis. And when such jobs are run, the Spark clusters needed to run them can be created on the fly, then terminated.

Cluster types
Clusters can also be explicitly created, which is necessary for doing interactive work against Spark. Standard clusters allow for a great deal of customization in their configuration, including the virtual machine (VM) type of driver and worker nodes; the number of worker nodes deployed and whether auto-scaling will be used to adjust it; the versions of Databricks, Spark and Scala deployed; and an inactivity timeout after which the cluster will be automatically terminated.

A "serverless pool" can be created instead. Despite the seeming contradiction in terms, a serverless cluster's configuration is handled automatically and the user need only specify its name and the VM type for, and number of, worker nodes. Serverless Pools are in beta and are designed for running Python and SQL code interactively from notebooks. Production notebooks or any notebook with Scala or R code should be run on Standard clusters instead.

It's the platform, stupid
Databricks sells itself not as another flavor of Spark but as the Unified Analytics Platform: a collaborative platform for data prep, analytics and machine learning/AI that happens to be powered by a commercial, optimized version of Spark. Azure Databricks and its deep integration with so many facets of the Azure cloud, and support for notebooks that live independently of a provisioned and running Spark cluster, seems to bear that out.

You can almost look at Azure Databricks as a data engineer's abstraction layer over a huge chunk of the Azure cloud itself. The fundamental elements of its environment, namely a workspace with notebooks, databases, clusters and jobs, brings some order to both the Azure cloud and Spark's own SQL, streaming, machine learning and graph processing sub-components.