Databricks is the company founded by the creators of Apache Spark. Its Unified Data Analytics Platform (UDAP) is a cloud-based, managed and optimized Spark service available directly from Databricks on the Amazon Web Services cloud or as a Microsoft-supported service on that company's Azure cloud. More recently, Databricks has added new capabilities to UDAP that go beyond Spark, notebooks and its other basics. Today, at Spark+AI Summit Europe in Amsterdam, Databricks is making announcements around two of these technologies: MLflow and Delta Lake.
Go with the (ML) flow
MLflow is a Databricks open source project that's integrated into UDAP but available on an open source basis for integration with other platforms. MLflow helps with machine learning experiment and model management, allowing the logging of different algorithm and hyperparamater configurations, along with the accuracy of the models they are used to produce. MLflow also defines a model persistence format that makes the models shareable. And today, to build on that, Databricks is announcing the addition of the MLflow Model Registry and a private preview release of it integrated into the UDAP.
MLflow Model Registry offers a combination of discoverability, operational control and governance of machine learning models within an organization or community. And because MLflow is not proprietary to Databricks, it also means that models built on non-Spark platforms could be more easily used from Spark, and vice-versa. The registry enables governance of models by tracking their history and managing who can approve changes. It can also manage the movement of models from experimentation to testing, and to production, either through policy-based automation or manual movement by authorized MLflow users.
MLflow provides all of this manageability in a space that otherwise has only a smattering of commercial, proprietary tooling to help with the problem of managing huge arrays of models and model experiments. Perhaps that's why, according to Databricks, MLflow is racking up 800,000 monthly downloads, and growing. Clearly, it's filling a need, and the addition of Model Registry will round out its capabilities nicely.
Delta Lake offers a layer on top of Spark SQL and the Parquet files stored in the Databricks File System. Through the use of difference (delta!) files and special indexes, Databricks has added important capabilities to its data lake stack that make updates both high-performing and, like a conventional relational database, transactional and ACID-compliant. This means that new data can be added to the lake, then queried immediately and efficiently, addressing a key data lake pain point.
Also read: Databricks unveils new open source project to clean up data lakes
Initially, Delta Lake was a proprietary feature called Databricks Delta, and was unavailable to the wider Spark ecosystem. Subsequently, Databricks announced that it would open source Delta Lake, both in terms of the file format and protocol, as well the Spark SQL implementation of same. The source code was placed on GitHub, increasing the potential for Delta Lake to be a well-adopted open standard.
Linux Foundation governance of Delta Lake
Today, however, Databricks is taking this a step further, and announcing that it is transferring governance of the open source Delta Lake project to The Linux Foundation. Doing so, the company says, will more greatly enable contributions to the project from outside Databricks. While a self-managed open source project does allow individual contributors to participate, governance around external contributions by full-fledged teams requires greater and more formal effort. Delta Lake governance by The Linux Foundation will, Databricks feels, allow such broad participation to flourish.
And there's good reason for this. It turns out that several prominent companies are interested in participating. Intel has a stated interest in contributing to the Spark SQL implementation of Delta Lake that Databricks built. Meanwhile Starburst, Alibaba and Booz Allen Hamilton are interested in developing new implementations of Delta Lake, over Presto, Apache Hive and Apache NiFi, respectively. Such broad implementation of Delta Lake would go a great way toward making it a de facto standard in the data lake stack. This would strengthen the data lake model overall and make Delta Lake a commonly available technology across clouds and engines.
Between MLflow and Delta Lake, on top of Apache Spark itself, Databricks has been and is contributing mightily to the open source analytics and machine learning world. It will be interesting to see what halo effect this might have on UDAP. No matter what, though, the company and its founders are great open source citizens.