The term "data lake" has been popular for a few years now, particularly in the context of Hadoop-based systems for large-scale data processing. But as Constellation Research VP and principal analyst Doug Henschen notes in an in-depth new report, it's no simple task to create a data lake that lives up to the concept's potential:
The rough idea of the data lake is to serve as the first destination for data in all its forms, including structured transactional records and unstructured and semi-structured data types such as log files, clickstreams, email, images, social streams and text documents. Some label unstructured and semi-structured as "new" data types, but most have been around a long time. We just couldn't afford to retain or analyze this information--until now.
Data lakes can handle all forms of data, including structured data, but they are not a replacement for an enterprise data warehouse that supports predictable production queries and reports against well-structured data. The value in the data lake is in exploring and blending data and using the power of data at scale to find correlations, model behaviors, predict outcomes, make recommendations, and trigger smarter decisions and actions. The key challenge is that a Hadoop deployment does not magically turn into a data lake. As the number of use cases and data diversity increases over time, a data lake can turn into a swamp if you fail to plan and implement a well-ordered data architecture.
It's erroneous to approach a data lake with the view of it as one monolithic repository, Henschen writes. Rather, a data lake should be split up into "zones" based on a particular data type's profile:
If Hadoop-based data lakes are to succeed, you'll need to ingest and retain raw data in a landing zone with enough metadata tagging to know what it is and where it's from. You'll want zones for refined data that has been cleansed and normalized for broad use. You'll want zones for application-specific data that you develop by aggregating, transforming and enriching data from multiple sources. And you'll want zones for data experimentation. Finally, for governance reasons you'll need to be able to track audit trails and data-lineage as required by regulations that apply to your industry and organization.
This is no simple matter, Henschen writes. In order to complete it enterprises will require a mature set of tools for data ingestion, transformation, cataloging and other tasks.
The problem here is that while the Apache Hadoop community has been working on such tools, they may make the average enterprise IT data-management professional scratch their heads due to unfamiliarity, he adds. The good news is that a broader ecosystem has emerged around Hadoop, looking to tackle problems associated with managing data lakes.
Henschen's full report takes a much deeper dive into data lake governance and strategy principles. An excerpt is available at this link.