Cloudera Machine Learning release takes cloud-native path

By updating its Data Science Workbench with a new edition to run on Kubernetes clusters, is it opening the door toward a broader cloud-native transformation for its product portfolio?

cloudera.png

On the heels of its last quarterly report prior to the expected closing of its merger with Hortonworks, Cloudera has announced the ability to gain access to a preview of a new cloud-native counterpart for its Cloudera Data Science Workbench (DSW) that goes full tilt on Kubernetes. Significantly, it carries a different branding -- Cloudera Machine Learning (Cloudera ML).

The architecture and the branding reflect two shifts in the market. The first is the move to cloud. While we estimate that only about 25 – 30% of Cloudera's installed base is running workloads in the cloud, the velocity toward cloud adoption is unmistakable. Ovum has predicted that next year, half of new big data workloads will be running on the cloud. And that dictates supporting the type of autoscaling that is possible in the cloud.

The second trend is AI, or more specifically machine learning. When Cloudera initially released DSW, the brunt of activity was building more on building conventional data science models that are static – they are deployed, and then any changes made to the models are done by people.

Today, to say that there is interest in AI (mostly the machine learning form) would be an understatement. The move to adopting AI reflects the fact that models, frameworks, and compute are more accessible than ever – thanks both to dedicated cloud services and to availability of GPU resources that, through the cloud won't force enterprises to blow their next three years of capital budgets for AI compute.

And, given the availability of dedicated services like Databricks (for Spark workloads), and Amazon SageMakerAzure Machine Learning, and Google Cloud AutoML, there are alternatives to Hadoop for running machine learning workloads.

You can certainly use DSW for AI problems, but the challenge is in economically managing compute. So, Cloudera adapted the DSW offering with an additional one: Cloudera ML. It responds to these trends with a new Kubernetes-based architecture that bypasses YARN resource scheduling of on-premise Hadoop clusters. To be clear, this doesn't replace the existing DSW that runs on Hadoop and YARN, but it provides another edition that works in Kubernetes environments.

This is not the first time that Cloudera has supported containers for data science or ML workloads; by using containers, Cloudera could package the interdependencies needed for physical deployment. But given that the original DSW was targeted at Cloudera Enterprise customers running Hadoop clusters, it ran Spark workloads under YARN to fit into the same deployment.

The cloud is a different story. First off, the data lake is typically in cloud object stores, not HDFS. Secondly, Cloudera CDH (using YARN) does not support out-of-the-box autoscaling -- the ability to ramp up and down compute capacity -- because it was designed to operate on clusters where data and compute were on the same nodes. With Kubernetes becoming the de facto standard for cloud native compute (even AWS, which had its own proprietary container management services, has bit the bullet and begun offering a managed Kubernetes service), the die was cast for Cloudera. If it wanted to support customers in the cloud, DSW or its successor would have to embrace Kubernetes, not YARN.

Cloudera ML is now in limited private preview, supporting access to data in cloud object stores, HDFS, and external databases, with deployment in the public cloud, or, eventually on premises (in private clouds) via OpenShift

Broader questions

While Cloudera ML is the company's first release of a 100% Kubernetes-based product, we don't view this as an isolated foray or outlier. In the background, the Apache Hadoop community has embarked on decoupling Hadoop from HDFS so that cloud object storage will also be a first-class citizen. With Hadoop no longer the only place for running big data, or specifically, ML workloads, we wouldn't be surprised if at some point, Cloudera unleashes Cloudera ML for running on any Kubernetes cluster, on-premises or in the public cloud.

And that's where some broader questions come in.

Clearly, Cloudera is going to continue supporting on-premise, which is the core of its current installed base. As an on-premise vendor that is extending toward the cloud, it will increasingly differentiate itself through its support of hybrid. But supporting hybrid means adding cloud-native options, just as it is now doing by augmenting its DSW product line with Cloudera ML. So, what about other workloads like data engineering or data warehousing? In the cloud, those could also benefit from running on Kubernetes clusters.

And that once more leads to the perennial question of what makes Hadoop, Hadoop. Recall that there are efforts underway to make the Hadoop platform more cloud-friendly, from decoupling storage to accommodating containerized workloads. These are long-term initiatives underway in the Apache community. So, once you supplant HDFS with cloud object storage, and MapReduce with Spark, what are you left with? That's where governance, management, and support of multiple types of workloads will differentiate Hadoop from big data point services. Whether the resources are dictated by YARN or Kubernetes will become an academic question. It's not even 2019 yet, but we'll still make this prediction: In the future, the kind of Hadoop you run will be based on how you deploy it.