Databricks unveils new open source project to clean up data lakes

Delta Lake is a new open source project that could help data scientists and data engineers untangle their batch and streaming data pipelines by adding a transaction layer.

delta-lake-logo.png

During our 2018 year ahead predictions, we forecast that cloud storage would become the de facto data lake. The dilemma is that cloud storage was designed for just that – storage. But increasingly, business analysts and data scientists want to get access to that data. With Athena, AWS made data in S3 queryable. ChaosSearch turned your Amazon S3 storage into a de facto Elasticsearch cluster. Cloud data warehouses extended their reach to query cloud storage, while most cloud managed Hadoop services use that storage layer as the default option.

The challenge, of course, is that data pouring into cloud storage tends to land there by default. Guess what? In those scenarios, good things like governance or tracking of data lineage end up inconsistently applied, if at all. Admittedly, the losses might seem trivial if the purpose is simply to explore data before conducting the analytic runs on which decisions are made. The drawback with that rationale is that, in an era of GDPR, enterprises might get into trouble storing data to which they are not entitled. Then there are the perennial data validation issues that occur when you have multiple, conflicting versions of the truth. It can throw data science or machine learning projects off kilter. In the data lake era, "garbage in, garbage out" is hardly obsolete.

In the run-up to Spark + AI Summit, Databricks is unveiling a new open source project, Delta Lake, which has nothing to do with the bayou or harvesting crawfish. It handles data processed using Spark to make it transactional, and lands it into common Parquet format. Delta Lake, which is available under an Apache 2.0 open source license, applies an ACID transaction layer that bolts onto Spark data pipelines to ensure that data updates arriving by stream and/or batch won't trip up over each other, resulting in either partial or duplicate corrupted commits. If undifferentiated cloud storage is the de facto data lake, this aims to develop a clean landing zone.

Having transactional support means that data engineers or developers won't have to build a separate layer to ensure consistent updates. That has major impact given that fact that data lakes typically have multiple data pipelines that are reading and writing data concurrently. Databases developed transaction support to make data commits clean; until now, data lakes lacked such mechanisms, forcing data engineers or developers to write their own transaction logic. In most cases, doing nothing was the default option given the alternative of laborious and hard-to-maintain custom development.

Delta Lake allows you to enforce schema if you choose, a concept more associated with relational databases rather than data lakes (schema enforcement is optional). It also provides snapshots so developers can access or revert to earlier versions. That is useful, not only for audits, but for testing the validity of any model. As it is fully Spark-compatible, it will plug into existing Spark data pipelines.

With Delta Lake, Databricks is banking on the fact that ACID won't pollute lakes, but cleanse them.