Streamlining data science with open source: Data version control and continuous machine learning

Can an open source-based workflow leveraging version control and continuous integration and deployment help streamline machine learning, like it did for software development?

MLOps, short for machine learning operations, is the equivalent of DevOps for machine learning models: Taking them from development to production, and managing their lifecycle in terms of improvements, fixes, redeployments, and so on.

Achieving MLOps nirvana is a major barrier to getting value out of machine learning and data science. Version control systems like Git and practices like continuous integration / continuous deployment (CI/CD) have helped operationalize software development.

What if those systems and practices could also be used for MLOps? Iterative.ai wants to address this question with open source projects Data Version Control and Continuous Machine Learning.

Bringing version control to machine learning

Data engineers, machine learning, and data science practitioners work with a wide range of data. They need to have a workflow and tools to support it to keep track of their artifacts and their versions, resolve issues, and collaborate across teams and systems.

Iterative.ai is an MLOps company dedicated to streamlining the workflow of data scientists. Today they announced the latest releases of Data Version Control (DVC) and Continuous Machine Learning (CML) open-source projects.

Iterative.ai claims DVC and CML remove the need for proprietary AI platforms by extending traditional software tools like Git and CI/CD to meet the needs of machine learning Engineers. ZDNet connected with Dmitry Petrov, CEO and founder of Iterative.ai, to find out more about DVC and CML.

cml.jpg

CML is an open source project that aims to help facilitate the machine learning workflow

The goal of DVC is to bring agility, reproducibility, and collaboration into existing data science workflows. DVC provides users with a Git-like interface for versioning data and models, bringing version control to machine learning to address the challenges of reproducibility.

DVC is built on top of Git, allowing users to create lightweight metafiles and enabling the system to handle large files, rather than storing them in Git. It works with remote storage for large files in the cloud or on-premise network storage.

CML is an open-source library for implementing continuous integration and delivery (CI/CD) in machine learning projects. Users can automate parts of their development workflow, including model training and evaluation, comparing machine learning experiments across their project history, and monitoring changing datasets. CML will also auto-generate reports with metrics and plots in each Git pull request.

SEE: Analytics: Turning big data science into business strategy (ZDNet/TechRepublic special feature) | Download the free PDF version (TechRepublic)

Projects and products

That sounds almost too good to be true: fully open source projects that deliver that kind of functionality and value? Great, but what's the catch, and for whom? Are the projects really open source, or maybe open core -- i.e., are there proprietary parts? And what is iterative.ai's business model?

A hosted service (SaaS offering) for DVC and CML looks improbable at first blush. As Petrov noted, there is no such thing as hosted DVC or CML because they are distributed and on-premise by design like Git or Terraform. The business model, Petrov went on to add, is similar to HashiCorp:

"We build open-source tools and give them to practitioners for free. We build DVC and CML while HashiCorp builds Terraform, Vault, and others. Monetization comes from enterprise scenarios (better data access control, security, integrations, team collaboration, etc). Those are separate products on top of DVC and CML."

graphic.jpg

DVC is an open source project that aims to help data engineers and machine learning practitioners use version control for their projects

The other thing that struck us about the combination of DVC and CML is that they seem to pack a lot of functionality, which is actually quite complex. Most software developers, for example, don't use Git through the command line, but rather via IDEs - visual tools for software development that integrate version control on top of Git.

executive guide

What is machine learning? Everything you need to know

Here's how it's related to artificial intelligence, how it works and why it matters.

Read More

It turns out there is an analogy here. Iterative.ai also offers DVC-Studio, packing UI, and collaboration features on top of DVC and CML. Petrov likened this to Git + GitHub. DVC-Studio is not open source, and not officially released yet either:

"Today people use DVC and CML as-is, and it's mostly a command-line experience. Without Studio, these two are still functional. Like Git and GitHub - you don't need GitHub or GitLab to use Git, but it is nice to have," said Petrov.

From a community to the enterprise

How many people do actually use DVC and CML as-is today? Quite a lot, it would seem. Iterative.ai counts 400+ companies, 4,000+ community members, plus 200+ contributors and 7000+ Github stars. Petrov also mentioned an additional 2000+ users for DVC.

Petrov, a computer science Ph.D., is a data scientist himself, previously at Microsoft - Bing. DVC was his pet project when he started it in 2017 before he incorporated iterative.ai with co-founder and ex-colleague Ivan Shcheklein.

As for today's announcement, Petrov highlighted lightweight machine learning experiments as the major feature in DVC 2.0. DVC is great for making machine learning projects reproducible but it creates some overhead, as a Git-commit is needed for each step or experiment.

iterative-ai.png

iterative.ai's product offering, based on DVC and CML

DVC 2.0 simplifies and automates this experience. Machine learning experiments can now be created in a single command and be fully reproducible, Petrov said. Another step toward experimentation is machine learning model checkpoints and live metrics or logs.

These two are important for deep learning scenarios when you need to track the machine learning training process and use not the latest model but one of the previous models (checkpoints), Petrov added.

Today DVC and CML's adoption is purely bottom-up and community-driven. Although we do not have more details on specific enterprise use cases or iterative.ai's venture backing at this point, Petrov mentioned that plans include growing the current headcount of 19 to 30+ in 2021.

DVC and CML seem like a reasonable idea, and adoption looks promising. It's worth keeping an eye on the projects, as well as iterative.ai, to see how traction translates to enterprise use and sustainability.