Why AI and machine learning are driving data lakes to data hubs

Data lakes were built for big data and batch processing, but AI and machine learning models need more flow and third party connections. Enter the data hub concept that'll likely pick up steam.

The data lake was a critical concept for companies looking to put information in one place and then tap it for business intelligence, analytics and big data. But the promise never quite played out. Enter the data hub concept, which is starting to become a rallying point for technology vendors as enterprises realize they have to connect to more than their own data to enable their algorithms.

Pure Storage last month outlined its data hub architecture in a bid to ditch data silos and enable more artificial learning, machine learning and cloud applications. On Oct. 9, MarkLogic, an enterprise NoSQL database provider, launched its Data Hub Service to offer better curated data for Internet of things, AI and machine learning workloads. MarkLogic claimed that its Data Hub Service is actually "data lakes done right."

Meanwhile, SAP also has a data hub that's focused on moving data around. And you could argue that the $5.2 billion merger of Cloudera and Hortonworks will put the combined company on a path to be a broad enterprise platform that will eventually have data hub features.

Rest assured that the term "data hub" is going to be a phrase mentioned by enterprise technology vendors. Data hub may also be a phrase in the running for the 2019 buzzword of the year race.

So what's driving this data hub buzz? AI and machine learning workloads. Simply put, the data lake is more like a concept designed for big data. You can analyze the lake, but you may not find all the signals needed to learn over time.

Jeremy Barnes, chief architect of ElementAI, said "the data lake is not dead from our perspective." But the data lake model "doesn't take into account AI and the ability to learn. It needs to adapt to something that enables intelligence systems to evolve," said Barnes.

ElementAI's mission is to take research and turn it into a product for businesses. Based in Montreal, Element AI leverages its own research as well as a network of academics to help clients develop their AI strategy.

Primers: What is AI? Everything you need to know about Artificial Intelligence | Machine learning? | Deep learning? | Artificial general intelligence?

To Barnes, the data lake model is built on the idea that the data needs to be in one place and accessible. The issue is that AI is less about the data and more about the signal that's in the data lake, said Barnes. "The data lake doesn't match the reality of bringing AI into processes," said Barnes.

As a result, ElementAI worked with Pure Storage to create a data hub architecture. Pure Storage recently rolled out its Data Hub architecture to account for the reality that data has to be connected via API to outside partners. Data lakes are more internally focused and lack the flexibility to account for the entire data sharing cycle.

The AI, machine learning, and data science conundrum: Who will manage the algorithms?

For ElementAI, moving to a data hub setup was more about time to market. "We started working with Pure a year ago because we had performance issues with the architecture we have put in place," said Barnes. The architecture needs to be flexible with a software layer controlled by data, he added.

Moor Insights & Strategy noted in a research note that the Pure Storage's Data Hub concept is about melding data and storage architecture. A data hub requires multiple parallel compute and storage elements that can be partitioned and tuned for workloads. Software orchestration will get data to applications as it is needed.

data-hub-architecture.png

Source: Pure Storage, Moor Insights & Strategy


Now data lakes will be part of the architecture, but sharing, real-time processing, on-demand infrastructure and virtualization require more of a data hub concept to produce better models and insights. The data hub bandwagon is nowhere close to being filled up, but here's a bet that it'll get crowded relatively quickly.

Related: