Nature abhors a vacuum and simplicity abhors best of breed. In an ideal world, there would be an all-encompassing umbrella solution that could meet all of your needs from soup to nuts. You'd have fewer moving parts, fewer integration issues, and most importantly, just a single throat to choke. The debate between umbrella vs. best-of-breed remains as relevant than ever today, especially when it comes to balancing the convenience of using just managed services from any of the usual suspects vs. preserving freedom of choice and avoiding vendor lock-in.
Just look at the debates that are surfacing as enterprises get serious about cloud migration. If you're on AWS, there's the convenience of using Amazon's DynamoDB interchangeably with EMR and integration with its Data Pipelining service for tiering data to S3 storage. The flipside of the coin is the question of how dependent your organization wants to get with AWS or any other cloud provider. That's a theme we'll return to later on.
So when we looked at data lake governance, we found that transparency (knowing what data is in the data lake) and security were paramount. But there is no single tool for making your data lake transparent and the data content discoverable. There's little fear of vendor lock-in here. Business teams and IT share responsibility for managing what information is in the data lake. Business teams are responsible for curating their own data, while IT is on the hook for ensuring that the data is secured and governed properly.
Managing the content of your data lake involves multiple tasks. There is profiling and preparing data to make it consumable, and matching and de-duplication for helping validate it. To make the data usable, there's the need to enrich it by blending related data (such as demographic or behavioral data for a customer) and/or the insights of your colleagues on the utility or provenance of the information. And to make it accessible, it makes sense to publish metadata into a catalog. So many tasks, and not so surprisingly, so many tools have emerged. And there's so little time.
Our kneejerk reaction is that a toolchain of four or five tools for performing these tasks won't be sustainable. But that assumes that you're working against a single, monolithic target. Reality is rarely so black and white. Just as the world moved on from the notion of a single galactic enterprise data warehouse providing the single source of truth around which analytics and satellite data marts thrived, so too has gone the notion that the data lake would live in a single Hadoop cluster. Chances are, your data lake is the universe of data stores sitting across your enterprise, whether that be your enterprise data warehouse, Oracle database, Hadoop cluster, and/or BI tool cache. Maybe that imaginary single-purpose Swiss army knife data inventory tool won't suffice after all.
Alation is one of that new wave of tools for helping the business make sense of what data is in the lake and how to query it. Last week, it secured $23 million in Series B funding, which will primarily focus on expanding its channels to market.
Like many of these tools, Alation fuses machine learning and crowdsourcing to perform its magic. For Alation, it's about cataloging the content of your data lake through crawling enterprise databases for harvesting metadata; tracking usage patterns for providing query recommendations; and offering natural language search for identifying tables.
Alation is hardly the only player providing a catalog, but most of its rivals incorporate it as part of broader offerings. In the Hadoop world, Cloudera Navigator includes cataloging as part of a broader data governance framework. Zaloni incorporates a data catalog as part of a bundle that manages and governs the populating of data lakes.
Providers like IBM and Collibra also offer catalogs as the byproduct of information stewardship approaches encompassing business glossaries, data dictionaries, policy managers, and master data-like reference data. But IBM's catalog (and data lake governance) capabilities are now being rethought in light of the new OEM relationship with Hortonworks, which brings in the Apache Atlas technology for tagging metadata. And you can get cataloging as an extension of the data preparation capabilities provided by the likes of Paxata.
Functionally, just about the only direct competition is Waterline Data, which has focused on a mix of machine learning and human curation to identify the provenance of data. But that does not extend to assistance that Alation provides for actually querying the data.
So Alation's challenge is proving it's more than just a product feature. To its credit, it has been successful in cultivating an OEM agreement with Teradata and a unique integration with Trifacta where users of each tool can toggle back and forth between cataloging and data prep. Since the Trifacta announcement went live late last year, both have lined up a handful of joint customers who are now putting the linked solution into production. Although both are positioned as self-service tools, in practice, data prep will likely be the domain of more technically savvy users or data engineers. So the question of whether to have the data folks prep data before the business folks catalog it or vice versa will become the chick-or-egg question for exploring the data lake.