Keeping track of all your data -- where it's been, where it's going, who accesses it, and what they do with it -- is neither fun nor exciting. But it is a necessary substrate for holistic data management, and in the age of GDPR and CCPA, it's also a legal requirement. This is what data governance is about.
Data catalogs are the unsung heroes of data governance. A data catalog is loosely defined as a metadata management tool designed to help organizations find and manage large amounts of data. Today, one of the key players in the data catalog space, Waterline Data, is announcing updates in its product, and ZDNet took the opportunity to discuss with founder and CTO Alex Gorelik.
Waterline Data catalog gets an update: DataOps dashboard and hybrid multi-cloud
Waterline Data is a single-product company. Its data catalog is what every solution it offers is based on, from Metadata Management and Data Lineage to Sensitive Data Discovery and Data Rationalization. Today's release is centered around a new DataOps Dashboard, which Waterline says can serve as a regulatory hub where companies can understand the macro risk of their data estate.
The DataOps Dashboard allows users to easily locate and view specific files that contain regulated sensitive data, and help expedite the identification, remediation, and documentation processes to meet GDPR and CCPA requirements. Gorelik, however, pointed out that there is another big improvement: A new agent architecture that enables hybrid multi-cloud support.
"Waterline can now catalog and automatically tag data in multiple clouds like AWS, Azure and Google Cloud Platform; on-premise big data systems like Cloudera and MapR; cloud databases like Snowflake and RedShift; and on-premise relational databases. The agents can run natively on Apache Spark or in a container for environments that do not have a Spark cluster," says Gorelik.
Another new feature is support for data residency laws that restrict sending data out of the country. An agent can be configured to do all processing and discovery locally, and only send non-sensitive metadata to the central catalog. Finally, there are improvements around usability, personalization, and collaboration.
Integrations and open source
Metadata really is the key here, and Waterline complements it with machine learning to automate as much of the drudgery as possible. This was the focal point of our discussion with Gorelik, starting with the exact nature of the metadata managed, as well as the integration with other systems that Waterline refers to.
Gorelik says that for relational databases, Waterline normally uses standard JDBC. Sometimes, however, they have to do platform-specific stuff. Waterline automatically recognizes file format and parses files (AVRO, parquet, JSON, XML, ORC, CSV, etc.) in file systems and object stores. Crawling is done automatically and incrementally: point Waterline to a folder or a database and it detects any changes and process new data.
Integration is done via REST APIs, which support two-way integration. Gorelik mentioned Waterline offers pre-built adapters that import lineage from Atlas and Cloudera Navigator, and export tags and tag associations to Atlas and Cloudera Navigator, where these tags are used to drive Ranger and Cloudera Sentry tag-based access control policies.
These REST APIs have their own JSON data definitions, but what we were really hoping to hear there was some kind of support for Egeria. Egeria is an ODPi open-source project which implements a set of open APIs, types and interchange protocols to allow all metadata repositories to share and exchange metadata.
Hortonworks was an ODPi member, Egeria was featured in Hortonworks' DataWorks event in 2018, and it seemed like this was the way forward for metadata management in the Hadoop world as far as Hortonworks was concerned. Apparently the Cloudera - Hortonworks merger has complicated things, as nowadays it's all about Cloudera Navigator for metadata management. However, Egeria was featured in the new Cloudera DataWorks event in 2019, too, so there may be hope still. Leveraging Egeria would be a good idea.
Egeria is looking at integrating metadata vocabularies and standards. An open-source effort would ensure interoperability and be beneficial for users and vendors. The new Cloudera has committed to a 100% open-source strategy, and there is a wed Cloudera partnership with IBM, key ODPi member and Egeria contributor. As John Mertic, director of program management for The Linux Foundation said in his Egeria presentation, "Ask your data management vendor for Egeria support -- ING does."
This is further supported by the fact that Gorelik notes Waterline usually goes with best-of-breed open-source projects. Currently, metadata is stored in SOLR for quick search access and in Postgres for dashboarding and analytics: "Since SOLR ships with most Hadoop distributions and provides a number of improvements over Lucene, it was a good choice for us. Postgres is free and very common."
GDPR, CCPA? There's machine learning for that, too
Metadata is great and all, but the problem is that not all data has it. Providing quality metadata takes time and resources, and frankly, it's not all that exciting. But as Gorelik says, GDPR was a wake-up call for many companies:
"Many of our customers have billions (with a B) of fields of data. People always knew how little was documented and known about their data. GDPR forced an uncomfortable discussion at the C level about the fact that, 'No, we really do not know where all of our customer data is.'
This in turn has led to companies investing in cataloging the data either manually through surveys and attestations or in an automated way using tools like Waterline Data. At one point, companies thought that they can catch the data at the point of exit -- i.e., check the black list before sending out a marketing email.
Companies soon realized that if a data set is compromised by hackers, they still have to notify consumers that their data was breached even after they asked to be forgotten, so they started focusing more on finding and managing data across the data estate."
Similarly, Gorelik notes, Brexit caused a lot of UK and multinational companies to draw contingency plans, including forming new subsidiaries to maintain EU presence. In this process, many realized that they do not have a clear handle on the data on which they needed to draw the plans, and on the data that would need to be separated in that eventuality.
Just like with GDPR, Gorelik went on to add, CCPA covers all data about the customers, not just Personally Identifiable Information (PII). And, as was the case with GDPR, it is causing affected companies to have uncomfortable discussions about not knowing where all of their data resides.
Waterline is trying to ease the burden of managing metadata by leveraging Aristotle, its machine learning system for filling in missing metadata. Aristotle leverages patented fingerprinting technology to automate the discovery, classification, management, and governance of this enormous amount of now-regulated sensitive data scattered across the enterprise.
As Gorelik explained:
"Fingerprint works across three dimensions: 1. content (the actual values and their characteristics) 2. metadata (names, comments, etc.) and 3. context (for example, a field containing numbers between one and six digits and no NULLs in a record with street names, city names and zip codes is very likely a house number; a record without any other address components is very unlikely to be a house number).
Or, to put it another way, the system isn't looking for additional metadata as much as it is automatically filling in extra details of each 'fingerprint' using metadata, data, and context together. All the previous outcomes -- someone tagging a field with a tag, accepting a suggested tag, and rejecting a suggested tag -- are used to calculate a confidence level that a certain field gets a certain tag."
Waterline offers what seems like a pragmatic and advanced approach to data catalogs and metadata management. As there are many approaches and solutions in this space, however, interoperability is key, so we hope we can see better support for this among various data sources and solutions in the future.