Moving fast without breaking data: Governance for managing risk in machine learning and beyond
How do you resolve the tension between the need to build and deploy accurate machine learning models fast, and the need to understand how those models work, what data they touch upon, and what are the implications? Immuta says data governance is the answer.
Immuta is a startup providing a data management platform for data science. In this GDPR era, when everyone should be conscious about what they do with their data, a data management platform sounds like a good idea. That may help explain why Immuta just raised a $20 Million Series B round to help build its platform.
Data management, however, is a term that may be directing many data professionals to think about how data is managed on the physical layer. While that also comes into play, when we talk about managing data on the organizational level, things such as data provenance and access rights are more relevant than, say, data storage schemes or indexing strategies.
ZDNet had an exchange with Matthew Carroll, Immuta co-founder and CEO, and finding a common ground on terminology is an obvious starting point to discuss Immuta's offering, positioning, and plans.
Moving fast without breaking data governance
"If you asked 10 people to explain data governance, I believe you'd get 10 different answers. We see governance as an enabler for organizations. If done correctly, everyone in an organization can rapidly gain access to data in a compliant way that is recognized and monitored to help ensure success no matter the role (compliance, data scientists, leadership)," says Carroll.
The requirement to address the needs of a wide range of stakeholders is crucial. That's something that has always been with us. Does the advent of data-driven software development make things all that different?
Historically, there have been a few things that have gotten in the way of getting all stakeholders involved in the software development process, which is what underpins and enables digital transformation. Lack of understanding and engagement by non-technical stakeholders, lack of communication, and immature software and processes.
What's different now is that in many cases not even the people who develop machine learning models used in applications can explain exactly how they arrive at their results. So how can we expect anyone else to?
It's about things such as clearly documented initial objectives and underlying assumptions and review boards. Many organizations would shun at the thought, adopting a "move fast and break things" mentality. Innovation, and moving fast, is good, but not at the expense of manageability, ethics, and regulatory compliance.
If organizations are not willing to take that stance on their own accord, then regulation like GDPR is going to make them. Then, the question becomes, what can a data governance platform do to facilitate this, so that quality and manageability does not slow things down too much?
A data governance add-on layer
Carroll says they find most governance solutions to be too focused on the compliance/data privacy side and disregard downstream users. Their goal, he adds, is to equally empower everyone, legal, compliance, data owners, and data scientists/analysts, while making Immuta invisible.
"The data scientist can just keep using SQL or Spark like they always have, but with Immuta, their data appears to be unified. They can self-service request access to data and be granted rapidly, and easily collaborate with others without sharing data exports around. It's a win-win for everyone and an accelerator to all data science objectives -- this is what we believe governance should be," Carroll explains.
That sounds good, but how does it compare to other solutions? One point about Immuta's solution is that it operates as a virtual control plane with the ability to connect to data sources (databases, file systems, etc) without copying the data.
In doing so, Carroll says, the access interface for BI and data science tools is abstracted:
"Rather than connecting a Looker to each database, a user would just connect to Immuta, and then he/she is able to connect to all of their enterprise data as a series of tables. To the user, we look like a single database.
For users wanting to do bigger workloads, we offer access patterns like HDFS and Spark. For those just writing SQL queries, we offer a SQL data access pattern. And for others, we have a file system. It's all virtual, and on query, we inject rules based on the rights of the user, the controls on the data and the purpose behind the question so as to determine what the data looks like."
Immuta has partnerships with tool providers such as Dataiku, Domino, DataRobot, and Looker that enables it to work like this. For data sources such as file systems or databases, Carroll says they have built connectors to provide a UI workflow to allow any data owner, without having to write code, to expose data within the Immuta Catalog.
That is done utilizing the open interfaces of those technologies without having a partnership in place, except in the case of Cloudera, but Carroll notes they would not rule out the possibility of partnering with a database or cloud storage company.
Some of these back-ends may not have a data governance layer of their own, but others do. This implies that Immuta can also play side-by-side other solutions, and Carroll confirms this. Immuta can work as a data governance add-on layer, it would seem.
Data across clouds, access policies in natural language
The fact that Immuta does not ingest any data means that it can function as an enabler for hybrid cloud workloads, says Carroll:
"You could expose data in Immuta that resides on-premise today in PostgreSQL. Then tomorrow, when you migrate that data to Redshift, you can re-point Immuta to Redshift and none of the downstream users know that the data moved.
We've seen use cases where customers want to share on-premise data to partners working on the cloud. However, they are worried about sensitive data leaving their infrastructure. You can install Immuta on-premise, and then have the cloud-based analysts query Immuta, thus ensuring data is fully compliant/anonymized as prescribed by the Immuta policies before it leaves your infrastructure headed to the cloud."
This touches upon more aspects of what Immuta does -- access policies and anonymization. Access policies are nothing groundbreaking. In fact, no data governance solution can exist without them. The difference, according to Carroll, is in how Immuta implements these policies:
"Some of our core IP is tied to how we allow users to build policies using natural language. Users with no coding or IT skills can immediately implement rules in Immuta, and I believe more importantly, any other (permissioned) user can understand, in plain English, what exactly the rule is doing.
Under the covers, we're able to translate these rules into SQL, which can be pushed down dynamically to any relational database or applied within SparkSQL at scale, to include very complex controls like differential privacy.
We have the ability to reason at the file-level using metadata; things like who can access what files for what purposes. We can also build what we call global policies, which are semantic rules. Instead of saying I want to mask the literal 'customer last name' column, for example, you could instead build a rule that says, 'mask anywhere there is a last name.'"
Managing risk in machine learning
Let us get back to our starting point: Machine learning. Immuta has been actively evangelizing the importance of aspects such as privacy, algorithmic bias, transparency, and explainability. But how is what it does relevant for these?
Carroll notes that governing machine learning models and the risks they generate requires a new way of thinking about governance, because of the level of complexity some of these models exhibit:
"There's a reason these models are called 'black boxes' so often. One of the most important factors in assessing and predicting risk relates to the input data. This is exactly where Immuta sits.
Because we act as a single, unified access and control layer for our customers' entire data science environments, we're uniquely positioned to understand what data is going into what models, and to understand when and how that data might exhibit bias or create privacy concerns that could impact these models, among many other risks."
That's not to say that this is easy, or Immuta has all the answers already, as Carroll acknowledges. The really difficult question, which Immuta is working on, is finding all the ways to ensure customers understand the risks of their data science activities.
"Once someone builds and deploys a model, how can we help them understand if the model suddenly becomes at an increased risk of making biased or incorrect predictions? Take the world of finance, for example.
If a model is trained only on data from an economic boom, how is that model going to behave when the economy dips, and how can we help our customers understand and react to changes like this in real time?"
Carroll referred many times to how Immuta's positioning as a layer streamlining access to data gives it a 360 view of all data, thus enabling it to solve these hard to tackle issues. This sort of begs the question of whether you can trust Immuta with all of your data.
Carroll says Immuta is always transparent to its users, who can maintain full control and visibility into who gets access to that data, and for what purpose. Governance personnel can monitor and set rules and audit actions taken against data, and data consumers can connect to that data even as rules change.
Another reservation would be how fast and seamlessly this can all work, especially when working against large datasets. Differential privacy is basically about adding noise based on the sensitivity of a question. Since datasets normally do not have noise, Immuta adds it dynamically at query time. And natural language processing is notoriously tricky, although in a more confined domain such as the one Immuta is operating in it can be easier to handle.
All in all, however, this seems like a promising approach. Carroll says the funding will be injected in growing the team to about 100-plus people over the next 12 months. Expansion into Europe is a priority, as GDPR makes this particularly timely.
Immuta wants to invest the brunt of the capital on its go to market strategy implementation, growing its sales and marketing. It also wants to expand to around 30 to 40 engineers across the board over the year.
Scaling up is always challenging, especially considering Immuta has about a third of its total stated headcount today. Immuta's technology and positioning are promising nonetheless, and despite the competition and challenges it will be facing, it's worth keeping an eye on.
Innovative artificial intelligence, machine learning projects to watch