In theory, data lakes sound like a good idea: One big repository to store all data your organization needs to process, unifying myriads of data sources. In practice, most data lakes are a mess in one way or another, earning them the "data swamp" moniker. Databricks says part of the reason is lack of transactional support, and they have just open sourced Delta Lake, a solution to address this.
Historically, data lakes have been a euphemism for Hadoop. Historical Hadoop, that is: On-premises, using HDFS as the storage layer. The reason is simple. HDFS offers cost-efficient, reliable storage for data of all shapes and sizes, and Hadoop's ecosystem offers an array of processing options for that data.
The data times are a changin' though, and data lakes follow. The main idea of having one big data store for everything remains, but that's not necessarily on premise anymore, and not necessarily Hadoop either. Hadoop itself is evolving to utilize cloud storage and work in the cloud., and
A layer on top of your storage system, wherever it may be
Databricks is the company founded by the creators of Apache Spark. Spark has complemented, or superseded, traditional Hadoop to a large extent. This is due to the higher abstraction of Spark's APIs and its faster, in-memory processing. Databricks itself offers a managed version of open source Spark in the cloud, with a number of proprietary extensions, called Delta. Delta is cloud-only, and is used by a number of big clients worldwide.
In a conversation with Matei Zaharia, Apache Spark co-creator and Databricks CTO. Zaharia noted that sometimes Spark users migrate to the Databricks platform, while other times it's line-of-business requirements that dictate a cloud-first approach. It seems that having to deal with data lakes that span on-premises and cloud storage prompted Databricks to do something to address one of their main issues: Reliability.
"Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into 'Delta Lakes'," said Ali Ghodsi, cofounder and CEO at Databricks.
Knowing where this is coming from, we had to wonder what exactly does it mean, and what kind of data storage does Delta Lake support?
"Delta Lake sits on top of your storage system[s], it does not replace them. Delta Lake is a transactional storage layer that works both on top of HDFS and cloud storage like S3, Azure blob storage. Users can download open-source Delta Lake and use it on-prem with HDFS. Users can read from any storage system that supports Apache Spark's data sources and write to Delta Lake, which stores data in Parquet format," Ghodsi told ZDNet.
Apache Parquet is the format of choice for Databricks. Parquet is an open-source columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework. So it seems Delta Lake acts as a layer on top of the supported data storage formats.
Reliability = Transactional support and more
Then, there's the reliability part. In the press release announcing Delta Lake, the wording mentions not just transactions, but also that "users will be able to access earlier versions of their data for audits audits, rollbacks or reproducing machine learning experiments." So, we wondered how much of that comes out of the box, and what Delta Lake offers exactly -- is it a standard, a tool, or both?
Ghodsi said that Delta Lake offers ACID transactions via optimistic concurrency control between writes, snapshot isolation so that readers don't see garbage data while someone is writing, data versioning and rollback, and schema enforcement to better handle schema changes and deal with data type changes:
"All of these contribute to adding reliability in data lakes. Data versioning and rollback is something Delta Lake offers out-of-the-box. This capability is completely open source and does not require any specific Databricks integration.
Delta Lake wants to standardize how big data is stored both on-prem and in the cloud. The goal is to make the data lakes ready for analytics and machine learning. To accomplish this goal, Delta Lake provides an open format and a transactional protocol.
As part of the open-source project, we have implemented the format and the protocol for managing transactions; including streaming and batch readers and writers for moving data to and from the Delta Lake."
Indeed, these are things Delta users have been able to use for a while now: Looking back in 2017, when Delta was announced, Ghodsi's quote was that they "basically added transactions and metadata" to Spark in the cloud. The intent to open source parts of this was stated back then, too.
Final destination: Data science and machine learning
"We've believed right from the onset that innovation happens in collaboration -- not isolation. This belief led to the creation of the Spark project and MLflow. Delta Lake will foster a thriving community of developers collaborating to improve data lake reliability and accelerate machine learning initiatives," said Ghodsi.
This technology is deployed in production by organizations such as Viacom, Edmunds, Riot Games, and McGraw Hill. Ghodsi noted that Databricks wants Delta Lake to be the standard for storing big data, and is committed to building a thriving open-source community:
"We've already had a lot of interest from some of our biggest end users who are excited about the prospect of extending the system for their own specific use cases now that it is open source. We want Delta Lake to be the standard for storing big data. To this end, we have decided to open source it for the entire community to benefit. We are working on ways to solve more data quality problems that users face while dumping data into data lakes."
Again, this is consistent with what Zaharia and Ghodsi have previously stated. In about 80% of use cases, as per Zaharia, people's end goal is to do data science or machine learning. But to do this, you need to have a pipeline that can reliably gather data over time. Both are important, but you need the data engineering to do the rest. Ghodsi conceded that Delta Lake won't remove the need for building data pipelines:
"Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest isolation level guarantee. Because of these reliability features, it tremendously simplifies the work of building big data pipelines."
This is not the only way to add transactional support to data lakes: Apache Hive is another one, for HDFS-based storage. But the added value comes from the combination of transactions and a unifying data format. Cloudera's Project Ozone is another effort to unify storage across clouds and on-premise, including transactions, but it's not production ready. The combination of Hive and Ozone could result in something similar to what Delta Lake offers, but it's not quite there yet.
This is a journey -- from data to insights, via data engineering. And from data pipelines to data science and machine learning, via transactions. To be clear, adding transactions to your data lake is not the end all: Quality data management entails more than this. But Delta Lake should be a welcome addition to the toolbox of all those who build those data pipelines on the way to data-driven insights.