Data lakes, don't confuse them with data warehouses, warns Gartner

After the hype comes disillusionment - and finally something of value emerges.
Written by Rob O'Neill, Contributor

In mid-2014, a pair of Gartner analysts levied some trenchant criticisms at the increasingly hyped concept of data lakes.

"The fundamental issue with the data lake is that it makes certain assumptions about the users of information," said Gartner research director Nick Heudecker.

"It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure."
A year and a half later, Gartner's concerns do not appear to have eased. While there are successful projects, there are also failures -- and the key success factor appears to be a strong understanding of the different roles of a data lake and a data warehouse.

Heudecker said a data lake, often marketed as a means of tackling big data challenges, is a great place to figure out new questions to ask of your data, "provided you have the skills".

"If that's what you want to do, I'm less concerned about a data lake implementation. However, a higher risk scenario is if your intent is to reimplement your data warehousing service level agreements (SLAs) on the data lake."

Heudecker said a data lake is typically optimised for different uses cases, levels of concurrency and multi-tenancy.

"In other words, don't use a data lake for data warehousing in anger."

It's perfectly reasonable to need both, he said, because each is optimised for different SLAs, users and skills.

Data lakes are, broadly, enterprise-wide platforms for analysing disparate data sources in native format to eliminate the cost and data transformation complexity of data ingestion. And herein lies the challenge: data lakes lack semantic consistency and governed metadata putting a great deal of the analytical onus on skilled users.

Heudecker said there is some developing maturity in understanding, but the data lake hype is still rampant.

The maturity of the technology is harder to get a handle on because the technology options to implement data lakes continue to change rapidly.

"For example, Spark is a popular data processing framework and it averages a new release every 43 days," Heudecker said.

The success factors for data lake projects, he said, come down to metadata management, the availability of skills and enforcing the right levels of governance.

"I've spoken with companies that built a data lake, put a bunch of data into it and simply couldn't find anything. Others have no idea which datasets are inaccurate and which are high quality. Like everything else in IT, there is no silver bullet."

Data lakes are an architectural concept, not a specific implementation, he said.

"Like any new concept, or technology for that matter, there will be accompanying hype followed by a period of disillusionment before becoming an understood practice.

"Data lakes will continue to be a reflection of the data scientists that use them.

"The technology may change and improve, perhaps taking advantage of things like GPUs or FPGAs, but the overall intent will be to uncover new uses and opportunities in data. Taking those insights to production will likely occur elsewhere."

Editorial standards