Tackling what we used to call big data has gotten beyond religious open source project debates. Instead, the challenge is breaking down the silos between data scientists and data engineers. This year’s Strata Data conference marks the end of the era where the focus was mostly on the platform.
Five years ago, Mike Olson, then of Cloudera, addressed the Strata Data conference, then in its first year in expanded quarters at the Javits Center in New York, calling for Hadoop to disappear. Hadoop started disappearing from the title of the Strata conference two years ago, and to glance at the list of new technologies that are changing this platform, like Kubernetes and cloud storage, the circle is getting complete. Hadoop, we hardly know you.
During this year's keynote, James Malone, a senior manager at Google Cloud, demonstrated how, using push of a button, customers could choose to deploy on Cloud Dataproc with YARN or Kubernetes. His demonstration showed, first, how the architecture of Hadoop is changing, and secondly, the advantages that Hadoop as a cloud-managed service could provide. As Andrew reported last week, in its latest release, Cloudera has separated compute from storage, a key prerequisite to taking advantage of the resource scaling flexibility that Kubernetes enables.
Although we didn't get attendance figures, exhibitors told us anecdotally that attendance looked lighter compared to the euphoria of several years back, but that the people who write checks did show up. The audience hasn't gone down, but it's dispersed. The sponsor makeup has clearly changed over the years.
As Big on Data bro Andrew Brust pointed out in his piece, there were few if any game-changing headlines coming out. Yes, there are some minor trends, such as emergence of new streaming alternatives like Flink, and pubsub systems like Pulsar. It's not time to close the patent office regarding new open source projects, just yet. But Hadoop zoo animals and the popularity battles of open source projects are no longer the headlines – there are more important things at stake.
Brust ventures that maybe it's a sign that the industry is maturing, moving on from "obsession with mere capabilities" toward successful implementations and references. That explains, for instance, the prominent role that data catalogs are playing across the industry. If you want to analyze data, you first should know what's in there.
And maturation also explains the fact that, after meeting with a couple dozen vendors over the two days, that the most uttered term we heard was data governance: vouching for the quality of the data, and that the handling and access to it consistently meets internal policies or external regulatory mandates. Given the growing incidence of data breaches and the emergence of new privacy laws such as GDPR, the concern is not coming a moment too soon.
Data governance is hardly a new idea, obviously, but extending governance to heterogeneous data that often comes from sources outside the organization has been patchwork. So, for instance, we saw data integration provider Infoworks speak of its data catalog piece as being the cornerstone for applying governance. And we met with a startup, Privacera, founded by the creator of Apache Ranger and Atlas projects, seek to extend those data auditing and tagging capabilities beyond Hadoop to other data platforms. Those are just a few examples. And we even started hearing rumblings about extending governance to machine learning models, but we've yet to see any products supporting that, at least yet. Watch this space.
Strata, and what we used to call 'big data" has hit midlife, and as such, is facing a midcourse correction going forward. The event was originally cast as the Hadoop conference when Hadoop was the shiny new thing, and at the time, the only way to efficiently analyze terabytes or petabytes of heterogenous data. While the event dropped "Hadoop" from its name a couple years back, the conference continued to be associated with it as it's still Cloudera's officially-sponsored annual user event.
The world of big data has moved on. Big data no longer equals Hadoop. Analyzing big data is no longer exceptional because the ability is not restricted to those with the sacred knowledge of setting up Hadoop clusters. It's no longer big data, it's just data, because there are many ways you can get to it. For starters, as we noted a couple years back, cloud storage, not Hadoop, has become the de facto data lake. Just stream lots of data into S3, and Presto, you have your data lake. Take your relational database and use federated query to reach out to cloud storage; if your data warehouse doesn't have it today, it will have that capability soon. You might write that query in SQL or push down the running of Python code inside the database.
If you're a data scientist, code your model in a Jupyter notebook, run a tool that can take that notebook and marshal up some nodes in the cloud to deploy, or take advantage of a Spark or AutoML point service to train and run your models. But the danger, with data scientists, is that they have become another silo, working in isolation on their laptops. And then, even if they use a tool that automates deployment out on clusters or in the cloud, those tools still lack features that introspect data and use machine learning to optimize or recommend refactoring of algorithms.
The problem is the gulf between data scientists who focus on models, but reluctantly spend the bulk of their time wrangling data, and data engineers and architects who know how the data is laid out and how to best distribute the code that runs against it. There's even a new cross discipline emerging, MLops, that bridges the two.
Until now, Strata reinforced those silos. Although the event had plenty of machine learning content, the target audience was largely aimed at data engineers and architects. This separation has been growing increasingly artificial, because increasingly, data, whether in on-premise clusters or in the cloud, is being used by data scientists for machine learning workloads. Platform or cloud services decisions must factor how the data is used and algorithms must account for how the data is laid out and how it should be transformed. And, decisions on the use of data must factor lineage, and with that, policies on if or how that data can be accessed or used. The good news is that next year, O'Reilly, the conference organizer, is bringing its AI conference into the fold so data engineers and data scientists might rub elbows. It'll certainly liven up the place.