Cloudera's MLOps platform brings governance and management to data science pipelines

Cloudera announced new machine learning-specific operations features are being added to its Cloudera Data Platform today. Beyond the news itself, we offer some details on the functionality and features.

Cloudera took a big step forward with its Cloudera Machine Learning (CML) platform today. The company is introducing new operational management features for machine learning models and governance features for the data science pipelines that produce them. See ZDNet Editor in Chief Lawrence Dignan's post for coverage of the news itself, and some really helpful analysis of how it positions Cloudera in the analytics market. To augment Dignan's analysis, I'll cover details on the machine learning operations (MLOps) features Cloudera is announcing today. And before doing so, I'll explain why customers need them to begin with.

Wherefore MLOps?

To understand why MLOps is necessary, consider that machine learning models are actually software. Typically, the models are deployed as REST-based Web services and they go through a development process involving the authoring of code. In addition to the software development parallels, machine learning also involves the use and processing of data sets, just as BI and other descriptive analytics work does.

For these very reasons, machine learning work should be supported by the same kind of source code management, testing, versioning and automated deployment that other software has. Similarly, data science environments need data governance support, including cataloging and lineage tracking of machine learning models and their underlying data sets. Cloudera's MLOps offering addresses both: model deployment and management features surface inside CML, while governance features show up in Cloudera's Shared Data Experience (SDX) fabric.

Atlas embraced

The governance features come to SDX as enhancements announced by Cloudera in December to the open source Apache Atlas project. Though Atlas is an industry-wide standard, Cloudera is its chief backer and the project was founded by Hortonworks, which merged with Cloudera in a deal announced back in October, 2018. Cloudera Data Catalog also has a basis in Apache Atlas.

Also read:

Machine learning governance features in SDX include the aforementioned model cataloging and lineage capabilities. SDX also provides security infrastructure over the REST Web service interfaces erected around deployed models. 

Management and administration

Management features in CML include automated deployment support as well as a model monitoring service for tracking performance, accuracy and drift of the model overall. CML can also track individual predictions made by the model, and how well they correspond to "ground truth", ensuring compliance and providing detailed context for assessing the model's overall accuracy. To manage and ensure the interpretability of machine learning models, CML offers built-in functionality to generate SHAP and LIME-based model and prediction explanations.

On the development side, CML is based on template-based projects, which consist of associated source code files, development sessions (configurable Kubernetes containers), experiments, models and jobs. As those projects progress, developers can embed API calls to CML within their source code to log experiments and their associated metadata and metrics.

Open platform, hyper/multi cloud

In an advanced briefing with ZDNet, Cloudera explained that given the governance features' basis in Apache Atlas, and CML being a component of the Cloudera Data Platform (CDP), Cloudera's MLOps capabilities are in fact open standards which the company hopes will see adoption by other industry players. Moreover, since CDP supports, and SDX manages, deployments across private and (potentially multiple) public clouds, the CML environment is portable across target platforms too.

Also read: Cloudera Data Platform launches with multi/hybrid cloud savvy and mitigated Hadoop complexity

Cloudera explained to ZDNet that among its customers are organizations that have progressed well past the evaluation phase of machine learning work and have tens, hundreds or even thousands of models in production. Managing of these models on an ad hoc basis and lacking structured development tooling to produce them is simply unsustainable. Necessity being the proverbial mother of invention, Cloudera MLOps is the company's concrete response to the needs of those customers. 

Let's now see how Cloudera's customer requirements-driven MLOps offering fares against pure play startup-produced MLOps  platforms from the likes of Datatron, Algorithmia and DotScience.

Cloudera is a customer of Brust's advisory firm, Blue Badge Insights.