Business

Databricks rolls out data sharing, automated pipelines, data catalog

At its Data + AI Summit, Databricks rolls out its new Delta Sharing, Delta Live Tables and Unity Catalog initiatives. For now, only Delta Sharing is open source, as the company looks to fill out its platform with all the bells and whistles.

Written by Andrew Brust, Contributor May 26, 2021 at 8:30 a.m. PT

The name of Databricks' annual conference has gone from "Spark Summit" to "Spark + AI Summit" and now to "Data + AI Summit." The evolution of the event name tracks Databricks' own transition from the Spark company, to the AI on Spark company, to what we might now call the "Delta Lakehouse" company. In testament to that, this year, at the event's second virtual incarnation, the company is rolling out a new open source project called Delta Sharing; a new proprietary SQL-based data pipeline platform called Delta Live Tables; and the new home-grown/proprietary Unity Catalog, for data cataloging needs.

Also read: Is Databricks vying for a full analytics stack?

ZDNet was briefed on all this and more by none other than Databricks CEO, Ali Ghodsi. The briefing, which was chock full of technical detail, demonstrated well how the company has been spending some of the $1B it raised in February in order to shore up its offering. The Databricks Unified Data Analytics Platform now runs on all three major public clouds, and features SQL analytics, data engineering/data science, governance, MLOps, pipelines, and data sharing, on top of an ACID-compliant data lake with an optimized query engine. Now, Databricks, the company founded by the creators of Apache Spark, seems most excited about building up its Delta brand.

Also read:

Thanks for sharing

Of the three pieces being announced today, Delta Sharing may have the most industry impact. It's an open standard for sharing data files in Parquet and Delta Lake formats (Databricks doesn't explicitly mention other formats) with internal users and external partners. Databricks says that Delta Sharing works in a fashion "completely independent of the platform on which the data resides." While sharing data files is of course possible with older common protocols, like FTP and even HTTP, Delta Sharing is governed, with what Databricks' press release says are "built-in security controls and easy-to-manage permissions." It's also an open source technology which, like other such recent projects from Databricks (think MLflow and Delta Lake), is being donated to the Linux Foundation.

The graphic Ghodsi shared with me when discussing Delta Sharing showed logos for several open source projects including Trino, Presto and Hive; BI products including Microsoft Power BI, Tableau, Qlik, and Google's Looker; an array of industry data management and analytics vendors including Alation, Collibra, Dremio and AtScale; as well as data providers that include Factset, Precisely and Foursquare. Other logos included those of Google Big Query and Microsoft Azure. The last of these is notable since Microsoft already brought its own Azure Data Share (ADS) offering to market almost two years ago. It all comes together though; Ghodsi explained to me that ADS will now be compatible with Delta Sharing, which will open it up to more non-Azure data sources and, ostensibly, non-Azure customers, as well.

Also read: Microsoft looks to 'do for data sharing what open source did for code'

I read it on the pipeline

Now let's move on to Delta Live Tables. Like the Delta Engine component, this one is not open source, at least not yet. Ghodsi described Live tables as a system for ETL (extract, transform and load) pipelines, but with a couple of twists. First of all, unlike the pipelines Databricks customers could already hand-code in a notebook, Live Tables is based completely on declarative statements. Ghodsi explained that these are SQL-based; the press release states that Live Tables use "high-level languages like SQL." Regardless, it's clear that imperative coding in Python, R or Scala is not required.

Second, because of Live Tables' declarative paradigm, the query optimizer that's already part of Delta Engine and Photon can in fact optimize these pipelines and even bundle them into efficient directed acyclic graph (DAG) execution packages. So not only is Live Tables a system for authoring, managing and scheduling pipelines with built-in error handling and restart capability, but it can essentially precompile those pipelines and optimize their execution.

Order from a catalog

Of course, to govern a data lake, track permissions on data sets for sharing and know the metadata needed to optimize pipelines, you really need a data catalog, too. And while Apache Atlas and Ranger are already out there providing a standard for this, Databricks has built its own, called Unity Catalog. The "UC" acronym for the product, Ghodsi intimated, is a nod to University of California, Berkeley's AMPLab, where Databricks' founders met and collaborated on what would become Apache Spark.

Another reason for the "Unity" name, though, is that the catalog tracks not just tables and files but also views, dashboards and machine learning models. It's all underpinned, Ghodsi says, by Delta Sharing; implements attribute-based access controls (ABAC); and, despite its go-its-own-way manifestation, Unity is compatible with other, existing, data catalog platforms.

A house by the lake

Databricks is, obviously, pushing its data lakehouse model hard, and it's respectably building out its platform to support those efforts. Ghodsi told me that the platform's SQL Analytics workspaces, which were announced in November of last year and have been in a gated preview ever since, have seen significant performance improvements in the interim, and are entering an open public preview the first week in June, on Azure and Amazon Web Services, with Google Cloud to follow shortly afterwards.

Also read: Databricks launches SQL Analytics

To show the world of data warehouse aficionados how serious it should take the lakehouse model, Databricks is going so far as to bring data warehouse pioneer Bill Inmon on stage, at the Summit, to share his enthusiasm for the lakehouse model. Whether the data warehouse community -- especially that part of it that follows Ralph Kimball more than Inmon -- will be convinced of the Lakehouse's efficacy remains to be seen. It's quite possible that the warehouse and lake/lakehouse may eventually coexist. Meanwhile, right now, it's fun to watch the two sides compete, innovate make their respective cases.

Editorial standards

Show Comments

Databricks rolls out data sharing, automated pipelines, data catalog

Thanks for sharing

I read it on the pipeline

Order from a catalog

A house by the lake

Related

Tabnine and Atlassian reveal new generative AI tools for developers

OpenAI rolls out new features to entice companies to build AI solutions

Memory maker SK Hynix's profit jump shows AI demand is going strong