Hortonworks unveils roadmap to make Hadoop cloud-native
Acknowledging the importance of cloud, Hortonworks is releasing a roadmap and partnering with IBM and Red Hat to transform Hadoop into a cloud-native platform. It's a journey that won't happen overnight.
It would be pure understatement to say that the world has changed since Hadoop debuted just over a decade ago. Rewind the tape to 5 - 10 years ago, and if you wanted to work with big data, Hadoop was pretty much the only platform game in town. Open source software was the icing on the cake of cheap compute and storage infrastructure that made processing and storing petabytes of data thinkable.
Since then, storage and compute have continued to get cheaper. But so has bandwidth, as 10 GbE connections have supplanted the 1 GbE connections that were the norm a decade ago. The cloud, edge computing, smart devices, and the Internet of Things have changed the big data landscape, while alternatives such as dedicated Spark and AI services offer alternatives to firing up full Hadoop clusters. And as we previously noted, capping it off, cloud storage has become the de facto data lake.
Today you can run Hadoop in the cloud, but Hadoop is not currently a platform that fully exploits the capabilities of the cloud. Aside from slotting in S3 or other cloud storage in place of HDFS, Hadoop does not fully take advantage of the cloud architecture. Making Hadoop cloud-native is not a matter of buzzword compliance, but making it more fleet-footed.
The need for Hadoop to get there is not simply attributable to competition from other bespoke big data cloud services, but from the inevitability of cloud deployment. In addition to cloud-based Hadoop services from the usual suspects, we estimate that about 25% of workloads from Hadoop incumbents -- Cloudera, Hortonworks, and MapR -- are currently running in the cloud. But more importantly, by next year, we predict that half of all new big data workloads will be deployed in the cloud.
So what's it like to work with Hadoop in the cloud today? It can often take up to 20 minutes or more to provision a cluster with all the components. That flies against the expectation of being able to fire up a Spark or machine learning service within minutes -- or less. That is where containerization and microservices come in -- they can isolate workloads or entire clusters, making multi-tenancy real. And they can make it far more efficient to launch Hadoop workloads.
Another key concept for cloud operation is separating compute from storage. This actually flies in the face of Hadoop's original design pattern, where the idea was to bring compute to the data to minimize data movement. Today, the pipes have grown fat enough to make that almost a non-issue. As noted above, separate compute and storage is already standard practice with most managed cloud-based Hadoop services, although in EMR, Amazon does provide the option of running HDFS.
Step 1 of the initiative will address containerization. Getting there won't be trivial. It's one thing to accept containerized workloads, but it's another to rearchitect all the components of Hadoop as containers, both at the cluster and at the edge. And once the Apache community gets to critical mass in refactoring Hadoop components into containers, there's the need to provide migration paths to the installed base.
Beyond containers, Hortonworks envisions the roadmap to encompass separating compute from data. That's step 2. To some extent, that's already de facto reality, as each of the major cloud provider managed Hadoop services already do that: they use their cloud object stores as in-kind replacements for HDFS, and keep compute separated (although Amazon does offer the option of running EMR with local HDFS storage). But connections, like S3A for connecting to S3, are not optimal, and you can't simply swap out HDFS for object storage if you're running your own private cloud.
This step will leverage work on the Ozone project, which aims to make HDFS look like a cloud object store. While we're tempted to say that Ozone is an idea that has floated in the ozone for a while, Hortonworks plans to ramp up the effort in one of the next stages of the project. The other element is changing the APIs to decouple HDFS from compute through new APIs, so on-premises customers can physically lay out their clusters as private clouds. These pieces won't fall into place until next year at the earliest.
Step 3 involves support of Kubernetes. In the short term, Hortonworks is getting HDP, HDF, and DataPlane Services (DPS) certified on Red Hat's OpenShift Kubernetes container application platform. IBM, which OEMs HDP, is following suit with Cloud Private for Data (ICP). While OpenShift addresses private cloud, the open question is support from each of the cloud provider Kubernetes platforms.
Beyond these three phases, Hortonworks views cloud-native Hadoop requiring governance that spans across cloud(s) and on-premise data centers. That's a checkbox that it is beginning to fill out with DPS framework. A work in progress, DPS is sort of uber catalog of services that is gradually being populated with plug-ins, such as Data Steward Studio, Data Lifecycle Manager, and more recently, Streams Messaging Manager, for governing replication, access control, and data flows across cloud and hybrid targets. There are also pieces in Atlas, Ranger, and Knox that will need to be adapted for hybrid and multi-cloud governance.
There will be many moving parts to making Hadoop cloud-native. Today, Hortonworks has unveiled the blueprint, but there are still blank spaces to be filled, like baking Kubernetes support into the Hadoop trunk. The Apache community has not yet committed on when that will happen. Making Hadoop cloud native will be a journey.