Informatica's latest acquisition extends machine learning capabilities into matching of data entities and schemas. And the acquisition came out of Informatica's first formal partnership effort with a university. The new capabilities will find their ways into Informatica's existing master data management (MDM), enterprise data catalog, privacy, governance, and data integration offerings.
The company, GreenBay Technologies, was co-founded by a University of Wisconsin at Madison computer science professor and began operation with ties to the university and its alumni research foundation. GreenBay and Informatica were hardly strangers, as Informatica was the sole investor in the startup.
GreenBay's CloudMatcher technology applies Random Forest machine learning approaches to several tasks. It matches data entities such as customers or products from multiple structured or unstructured data sets. That can mean the exact data that is in a specific field or extracting data from a block of text. It also performs schema matching, going above the individual data entity to objects or tables where it maps columns representing the same thing. As part of the process, it infers data lineage. And by inferring these schema match relationships, it adds to Informatica's metadata knowledge graph capturing the relationships between data sources.
It's not that matching tasks haven't been done before, with or without machine learning. For instance, machine learning has been a core pillar of data preparation tools that suggest that specific columns from two different data sets represent the same thing. Likewise, a core pillar of data de-duplication tools is to identify multiple instances of the same entities. And there have been some tools that help autogenerate master data or identify duplication of master data using machine learning.
The prime difference with the GreenBay capabilities, at least compared to data preparation tools, is one of scale; it is designed to handle mapping between thousands of data sets, compared to the handful from most self-service data prep tools. The other key differentiators are the ability to handle more diverse data across different domains, including semi-structured and unstructured data, and a crowd-sourcing approach that improves performance.
Schema matching is a much rarer commodity, with Tamr's master data matching likely providing one of the few examples. The challenge is that a model cannot simply look at column names, where different sets following different naming conventions. Instead, the task often encompasses taking clues from nearby columns, documentation, data values, and historical query patterns, discovering data relationships, and inferring links between target and source columns.
The company explained to us the rationale for its use of Random Forest techniques, which is a machine learning approach where multiple decision trees are run, and then subjected to a crowdsourced consensus process to identify the best results. It is a supervised approach where models are autogenerated after the user applies some declarative rules – that is, he or she labels a sample set of record pairs, and from there the system infers "blocking rules" to build the models. The company has not ruled out future refinements that might borrow unsupervised techniques such as reinforcement modeling where the system iteratively defines matching logic as it navigates the problem.
GreenBay will add to the machine learning capabilities that Informatica already uses, which are loosely branded as CLAIRE engine. Examples include business rules translation, data relationship inference, data domain inference, operational anomaly detection, mass data correction, and data transformation recommendations, among others. But when it came to entity and schema matching, Informatica relied mostly on rules-based approaches – which are far more time-consuming and harder to scale.
Informatica's plan is to incorporate the GreenBay technologies to add ML to several of its cloud services. Depending on the tool or process, the new capabilities will either guide, supplement, or in some cases replace existing rules generation processes.
The GreenBay technology helps in extending matching beyond identity data by matching product, supplier, location, and other types of data domains. Schema matching will be used to refine generation of rules for data quality. In turn, schema matching and the metadata knowledge graph will be used to sharpen the ability to identify and tag sensitive data for privacy protections; generate inferred lineage for enhancing data cataloging; and provide some baseline capabilities that could eventually autogenerate source and target mappings for data integration.
As noted, Informatica is no stranger to applying machine learning to data integration and governance, but when it came to AI for entity and schema matching, it found itself going back to school.