Hortonworks today is introducing Dataplane Service (DPS), a new cloud-based service for governing virtual data lakes. DPS is what Hortonworks terms a data "fabric." As we find that term a bit imprecise, we'll characterize DPS as a service that connects and implements security- and governance-related services, and presents them to the data lake administrator as a catalog of catalogs. DPS won't replace third-party catalogs used for data discovery, but work alongside them. It will sit in the cloud, providing a place to "register" data sources, build a catalog of metadata for managing data services. It builds on existing open source projects such as Apache Ranger, where security policies are created and enforced, and Atlas, which manages the metadata.
While DPS will be used for controlling clusters and managing governance and security, it isn't a cluster management, security, or governance tool per se. Instead, it is designed to plug in external services with published APIs. Hortonworks' goal is getting third parties to gin up such services. With the rollout, Hortonworks is including a Data Lifecycle Manager that performs information lifecycle management (ILM) capabilities encompassing replication, disaster recovery, backup and restore, and automated tiering of hot and cold data to different classes of storage. On the roadmap, Hortonworks plans additional services for security and deployment.
The need for DPS stems from the realization that data lakes will be virtual. With new mandates such as the EU General Data Protection Regulation (GDPR) placing strong requirements on data privacy, in many cases, enterprises will have to keep consumer data within the country of origin. Translation: if you're a multi-national company, privacy laws will require you to maintain multiple physical data lakes.
The same applies to cloud strategy. We expect that few large organizations will put all of their data in the cloud, or in any single cloud. Internal policies or public regulations may compel organizations otherwise bent on cloud deployment to maintain some data on premises. Sand for organizations looking to move a critical mass of their data and applications to the cloud, we expect that they will require at minimum second sources. Again, that will translate to multiple data lakes.
Then there is the fact of life that Hadoop won't sit on an island, but will coexist with data warehouses, NoSQL operational stores, IoT platforms, and streaming systems. That's where life gets interesting because databases are overlapping. SQL databases are adding JSON capabilities and vice versa, and they are federating query to Hadoop. Meanwhile, analytic tools aim to own the query, regardless of where it runs. Then there are the Informatica's and IBMs of the world with their portfolios of data integration tools, plus the newer generation of data preparation, cataloging, and data lake management tools.
DPS is a timely addition to the Hortonworks portfolio because it recognizes that organizations building data lakes are likely to be managing multiple instances and will require a measure of coherence across them. DPS does not replace the governance tools that are already part of the Hortonworks platform, including Ranger, where security policies are set; Knox, which acts as a gateway to enterprise directories for authenticating users; Atlas, for tagging data entities for governance; and Falcon, which is used for specifying data workflows.
The devil is in the details because cloud providers, database providers, data integration tools, and analytic tool providers all view themselves at the center of the world in planting their stakes with governance. And then there is the case of cloud providers that view security, identity and access control as core pillars of their service. Hortonworks' challenge with DPS will be the degree to which it can play good neighbor to each of these systems with intentions for governance.