More than words: Shedding light on the data terminology mess
Data management, data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps. It's a data terminology mess out there. Let's try and untangle it, because there's more to words than lingo.
We need XYZ. Definitely. It's in all analyst reports, it's trending off the charts, and our competitors have it, too. So let's find a vendor who's got it, and get ourselves invested. That should do it.
Sound familiar? Hopefully, technology investment decisions in your company are not made this way. But as technology is evolving faster than ever, it's hard to keep up with all the terminology. Unfortunately, some people see terminology as an obfuscation layer meant to glorify the ones who come up with it, hype products, and make people who throw terms around appear smart.
There may be some truth in this, but that does not mean terminology is useless. On the contrary, terminology is there to address a real need, which is to describe emerging concepts in a fast-moving domain. Ideally, a shared vocabulary should facilitate understanding of different concepts, market segments, and products.
Case in point: data and metadata management. Have you heard the terms data management, data observability, data fabric, data mesh, DataOps, MLOps and AIOps before? But, do you know what each of them means, exactly, and how they are all related? Here's your chance to find out.
Besides being officially proclaimed cool, there's another reason why they might know a thing or two: they've been around. Masschelein was employee number five at Collibra, who was, in his words, the first one selling software to Chief Data Officers -- before that was even a thing. Baeyens was founder and project lead at jBPM, a legendary business process management (BPM) open-source project.
Let's start with data fabric. Masschelein sees this as a framework for organizing data for scale -- a meta-layer for accessing all data relevant to an organization, wherever they may reside, in a unified way.
A data fabric focuses on the technology aspect of this unified access to data.
Data mesh is a similar concept but different in the sense that it focuses on organizational aspects. Masschelein finds that data mesh is akin to a modernized version of data governance principles, applicable for broader data teams. The goal is to structure and organize, removing some of the past bottlenecks, such as a reliance on a data warehouse team. Masschelein said:
"With data mesh, it's fundamentally about building data products and data services. So it's data product thinking. In data governance, we talk about managing data as an asset. When we talk about managing data as a product, this is more specific, ultimately. It's this notion that we should have core platform services. But then, on top of that, we need to have structure around data domains, areas, business, expertise, and knowledge, enabling them to be self-served. I think that's the key".
Data management, Masschelein went on to add, is a term that has existed for many decades already. It has been extensively described by the data management association, which has done a lot of work around how data should be managed. Ultimately, a part of that was metadata management, which spun out data cataloging software and data lineage capabilities.
Masschelein sees data monitoring, data observability, and data testing as specialized subdomains of quality management within the broader data management framework. Baeyens added context on data observability:
"You have engineers building data pipelines. They prepare data to be used in data products, such as machine learning models. There are a bunch of engineers developing new products regularly. Once those products get into production, that's where the observability starts. That's where the data could actually go bad. If the models using the data don't notice that the data is bad, this leads to all sorts of very costly and dangerous consequences".
Data monitoring, testing, fitness, and collaboration
As for DataOps, it's about using capabilities related to data, organized in best practice processes to deliver data products at an increasing velocity, all with increased reliability. Many small processes need to be put in place and standardized to enable working better with data, similar to what we've done with DevOps in software engineering, said Masschelein.
MLOps, which seems to be used interchangeably with AIOps, relies on a good DataOps foundation but is more specialized. In DataOps, we won't be monitoring prediction accuracy, for example. That is specific to the data product and also specific to the lifecycle of the data product. Masschelein thinks about it from a lifecycle perspective:
"Those are two separate things because the life cycle of a dataset is not tightly coupled to the lifecycle of machine learning or a data product, ultimately. There are also different people doing that. When it comes to managing data and DataOps, we have data producers which can be external to the organization, and you have internally generated data.
Another way of looking at it is the tooling landscape. And if you look at the monitoring and observability software stack, we have infrastructure at the bottom. So first, we write applications, and then nowadays we use data and machine learning as two kinds of new layers".
We're just getting started with software and platforms to help monitor these relatively new layers, whereas the other ones have existed for much longer, the duo notes. And this is where Soda's own platform comes into play. The name came about because the founders liked the idea of silent data issues bubbling up, like fizzy soda. So soda covers monitoring, testing, data fitness, and collaboration.
Monitoring is about automatically monitoring datasets for issues. That means trying to figure out if there's something abnormal about the data sets that land in your environments. For example, approximately how many records did you process this time around? Is that abnormal compared to what there was on the same day last week? Soda can use machine learning to spot anomalies, for example.
But monitoring only covers a small percentage of the types of data issues you can have. That's why data testing and validation is the next step. This is where you enable both the data engineers and subject matter experts. This is where rules such as "We can only have X percent of missing data in this column," "We need referential integrity," or "An allowable set of values" can be specified.
That's all fine and well, but if you have a system for the discovery of data issues, it will create a lot of alerts, so the question is: How do you handle the alerts? What is the business process that you go through? This is where data fitness dashboards come in. This enables SLA tracking, giving data owners a view of all the expectations on data across the organization and a workflow around the resolution of issues.
Last but not least, collaboration is a cross-cutting concern. Having collaboration features enables people with different knowledge about the problem, who often have tacit, undocumented knowledge, to work together and resolve issues. Baeyens mentioned that this also touches upon features not traditionally thought of as collaboration, such as enabling analysts to manage domain knowledge themselves without the involvement of data engineers.
Suds and Soda
The expertise in BPM that Baeyens brings to Soda has been leveraged in building the platform, specifically in how the different modules fit together in a workflow progression. Soda works with SQL sources, and Spark integration is almost there. The goal is to be able to cover as much of the data landscape as possible.
Soda may not cover all the key pillars of a comprehensive data fabric as per the Gartner definition, but then again, it's hard to think of many solutions that do. It does, however, augment data catalogs, focusing on DataOps. In addition, soda targets different user segments, and that is also reflected in its offering.
There is an open-source layer aimed at data engineers. Baeyens believes that the user segment is not necessarily interested in a SaaS offering. Open source Soda SQL aims to be simple and work with technology its target audience likes to use -- SQL and YAML, as per Baeyens.
Soda SQL is seeing good growth and adoption, and it's a way for people to get to know Soda. However, if they like what they see and their needs grow to include people such as analysts and CDOs, then it's time to move to the paid, SaaS version of Soda.
The company recently received €11.5 million in Series A funding, which, combined with their previous seed funding, gives a total of about €14 million. That should provide Soda with a good runway to develop its offering, with the aim to grow both the engineering and the go-to-market teams.
Soda's founders seem to have a firm grasp of the landscape they operate in if nothing else.