X
Business

Superconductive scores $21M Series A funding to sustain growth of its Great Expectations open source framework for data quality

Ensuring data quality is essential for analytics, data science and machine learning. Superconductive's Great Expectations open source framework wants to do for data quality what test-driven development did for software quality
Written by George Anadiotis, Contributor

Technical debt is a well-known concept in software development. It's what happens when unclear or forgotten assumptions are buried inside a complex, interconnected codebase, and it leads to poor software quality. The same thing also applies to data pipelines, it's called pipeline debt, and it's time we did something about it.

That's the gist of what motivated Abe Gong and James Campbell to start Great Expectations in 2018. Great Expectations is an open-source framework that aims to make it easier to test data pipelines, and therefore increase data quality.

Today Superconductive, the force behind Great Expectations, has announced it has received $21 million in Series A funding led by Index Ventures with CRV and Root Ventures participating. ZDNet caught up with Gong to learn more about Great Expectations.

We need to talk about pipeline debt

The antidote to technical debt is no mystery: automated testing. Testing builds self-awareness by systematically surfacing errors. Virtually all modern software teams rely heavily on automated testing to manage complexity -- except the teams that build data pipelines.

That was the departure point for Gong and Campbell. From there, they set out to develop the Great Expectations open-source framework, based on the core concept of an Expectation.

dag4.png

Data pipelines get more complex over time, and that can lead to data quality issues downstream. Image: Superconductive

Gong and Campbell have a data science and data engineering background themselves. So they have ample experience of how pipeline debt can lead to broken systems, or perhaps even worse, silently failing systems producing wrong results.

The data needed to populate dashboards, or run machine learning algorithms, typically has to pass through several stages in data pipelines. Data has to be ingested, integrated, cleaned and processed before it can be used.

As data pass through various stages in a pipeline, a change in the source data can have unintended consequences further down the line. Furthermore, as different teams get to touch data, assumptions made by one team are not necessarily known by others.

The way to fix this, Gong and Campbell suggest, is to introduce tests for data (instead of code), deployed at batch time (instead of compile or deploy time). The name they propose for this is pipeline tests, and this is what Great Expectations do.

An expectation is simply a statement about data. Something like expect_column_values_to_not_be_null, or expect_column_median_to_be_between and so on. On some level, they look like a schema -- a mechanism to ensure that data has specific allowed types and values.

Gong, however, argues that Expectations go beyond schema in some significant ways. First, Expectations supports things most schemas don't, such as checking the distribution within columns or looking for statistical relationships among different columns.

In addition, Gong likes to think of Expectations as a shared open standard for data quality. The thinking here is that data sources, as well as components in data pipelines, can change over time.

Many data sources, such as relational databases or other data management systems, have schema mechanisms in place. If at some point however those data sources change, Expectations can serve as a mechanism to retain data validation that typically goes with schema.

Great Expectations

Great Expectations integrates with a number of data tools and systems, from MySQL and Postgres to Jupyter Notebooks and Snowflake. So Expectations can be initialized based on existing schemas, or created from scratch.

The flexibility extends to what happens when an Expectation breaks. It can range from generating an alert, to stopping the processing altogether. And it's not just about what happens in the data pipeline -- the same mechanism and tools can be used to inspect data at rest.

Another use of Expectations is generating documentation. Gong mentioned that most organizations use some sort of internal Wiki, or data catalog, to keep track of their metadata. The problem is, he went on to add, that those are rarely 100% up to date.

The reason this happens is that it's additional work for the data teams to do this, and it often gets neglected or forgotten. Great Expectations wants to address that, by using Expectations as a source for generating documentation. The idea is that this will ensure documentation is up to date, as it will reflect the data.

pip.jpg

Great Expectations is not just a Charles Dickens book. It's also an open-source framework for data quality. Image: Superconductive

All of that sounds pretty ambitious, but Gong believes they are on the right track to making it happen. Superconductive started out focused on the healthcare domain, but they soon saw lots of traction across the board, so they made the decision to change their business model, focus exclusively on Great Expectations and take it to market.

Gong said that Great Expectations has rapidly grown into one of the fastest-growing data communities in the world, with more than 3,000 members in its community Slack channel, hundreds more joining every month, and approaching the 1 million downloads per month mark.

A broad variety of companies have integrated Great Expectations into their data analysis and management. This includes tech brands such as Lyft and Snowflake; enterprise brands such as Heineken and McKinsey; and many other companies, including Rent the Runway, Morningstar and Zymergen.

Although Gong committed to maintaining the core of Great Expectations open source, he mentioned there is a commercial offering that is being built, including enterprise features such as advanced tooling, security, SLAs, and support.

"There's no other tool in the Data Ops movement that comes close in terms of adoption for data quality, and funding will certainly help", Gong concluded.

Editorial standards