Data governance -- the discipline of inventorying and annotating your data sets, determining their accuracy, pedigree and quality and properly securing them -- is an important focus area for the industry. In the conventional database world, Enterprise Information Management (including ETL, data quality management and master data management) has addressed these needs for some time. In the data lake world, though, efforts have been far less earnest.
Granted, there are data catalog products, and lineage products. There are various security/access control solutions and there are metadata management systems as well. Cloudera has its Navigator product, and there's even open source Apache Atlas (incubating), borne of a Hortonworks project called the "Data Governance Initiative." Some analytics products even have governance features of their own, enticing customers away from bringing in another vendor and platform to handle governance requirements.
On Tuesday, MapR announced a data governance initiative of its own. It is comprised of an interesting architectural approach, a few key partnerships, and a service offering to go with it all. I'll first detail MapR's announced offering, and I'll conclude with some analysis on the state of data governance in the Big Data world.
On the technical side, MapR has come up an approach that, to me at least, is pretty novel and clever. The company has taken a prescriptive stance here and is advising customers that all data ingestion should go through MapR Event Streams (MapR-ES, formally known just as MapR Streams), the company's Kafka API-based publish/subscribe platform for handling event-based data ingest.
The hook, as it were, is this: by configuring a pre-processor on the MapR-ES topic, all data pushed through it can be observed, its discernible metadata captured in a MapR-DB document database, and metadata changes can also recorded there. This allows for metadata cataloging and, if derivative data set creation is managed similarly, and all MapR-ES events are retained, data lineage can be determined comprehensively, just by "playing back" the events.
The partner part
So MapR provides the raw infrastructure to get metadata and lineage information. But it doesn't offer a data catalog facility that would let data lake users search for data sets, tag them, see which of them are certified and see star ratings for them, provided by other users.
That's where partners and their products come in. Waterline Data and Collibra, each of which offers data catalog and data lineage functionality, are key partners. Cask, whose Data Application Platform (CDAP) provides a unified API over various Big Data components, and specific APIs for metadata inspection and for audit, is a partner as well.
By themselves, each of these products only catalogs what's entered into them. They work as long as everyone uses them (or codes to them, in the case of CDAP). Essentially there's an honor system in place.
The human touch
When combined with the MapR-ES governance bits, things can get more regimented, but only with a thorough implementation. MapR's Quick Start Solution (QSS) for Data Governance includes a professional services component that ensures a customer's implementation is successful. It does that by including the configuration of security and permissions such that data ingestion must take place through MapR-ES, as opposed to implementing that ability as a mere option.
With everything stitched together in this fashion, customers can have data governance over their data lake. Ingestion is forced through MapR-ES where a pre-processor is embedded to capture metadata and lineage information. This means all data on-boarding is actively observed and cataloged. That's the good news.
The bad news is that while this solution does provide for good data governance over the data lake, it is still siloed away from systems implemented on other platforms. For example, OLTP and Data Warehouse systems are governed separately -- and plenty of analysis can be conducted on those databases, outside the purview of the MapR system and its governance facility.
And all this at a time when high-profile data breaches -- whether at retailers, entertainment companies or governments -- are happening constantly. Accordingly, the data regulatory burden, including the impending deadline for compliance with GDPR (the EU's General Data Protection Regulation) is growing immensely, as it should. As much as we yearn for data governance, we are essentially in a period of data anarchy.
Let's be clear here: MapR is doing its fair share. Not only is it making its Converged Data Platform inter-operable with a number of data governance products, but it's providing guidance and even implementation services to make that integration compulsory rather than discretionary.
All for one?
But the industry as a whole needs to do better here. There needs to be industry-wide APIs and standards, as well as guidance on how to use them, from the vendors. There also needs to be automation, and lots of it. Manual cataloging and provision of lineage information relies on the full participation of data source owners. The chances of getting that full participation are dubious and even if best efforts could be assumed, the sheer number of data sources, and the rate of their growth, make manual cataloging unsustainable.
Data lake governance is in a fledgling state, just when it needs to be in a phase of real maturity. This is a crisis. And while vendors are finally paying attention to it, appreciation of its depths is still lacking, as is the level of urgency in getting good solutions to market. Automation and machine learning are desperately needed here, because having data catalogs doesn't work unless they're fully populated, at maximum accuracy.
Disclosure: I work with two companies, Datameer and Io-Tahoe, whose products offer data governance features and functionality.