Apache Spark creators set out to standardize distributed machine learning training, execution, and deployment
Matei Zaharia, Apache Spark co-creator and Databricks CTO, talks about adoption patterns, data engineering and data science, using and extending standards, and the next wave of innovation in machine learning: Distribution.
Not accidentally, last week was also the time when Spark and AI Summit Europe took place. The European incarnation of this summit. Its title this year has been expanded to include AI, attracting a lot of attention in the ML community. Apparently, it also works as a date around which ML announcements are scheduled.
MLFlow is Databricks' own creation. Databricks is the commercial entity from the original creators ofApache Spark, so having MLFlow's new edition announced in Databricks CTO Matei Zaharia's keynote was expected. ZDNet caught up with Zaharia to discuss everything from adoption patterns and use cases to competition, programming languages, and the future of machine learning.
Databricks' motto is "unified analytics." As Databricks CEO Ali Ghodsi noted in his keynote, the goal is to unify data, engineering, and people, tearing down technology and organizational silos. This is a broad vision, and Databricks is not the first one to embark on this journey.
Focusing on the technology part, it's all about bringing together data engineering and data science. As Zaharia noted, everyone begins with data engineering:
"In about 80 percent of use cases, people's end goal is to do data science or machine learning. But to do this, you need to have a pipeline that can reliably gather data over time.
Both are important, but you need the data engineering to do the rest. We target users with large volumes, which is more challenging. If you are using Spark to do distributed processing, it means you have lots of data."
More often that not, it also means that your data is coming from a number of sources. Spark, as well as Delta, Databricks' proprietary cloud platform built on Spark, already support reading from and writing to a number of data sources. The ability to use Spark as a processing hub for different data sources has been key to its success.
Now, Databricks wants to take one step further, by unifying different machine learning frameworks from the lab to production via MLFlow, and building a common framework for data and execution via Project Hydrogen.
MLFlow's goal is to help track experiments, share and reuse projects, and productionize models. It can be seen as a combination of data science notebooks enhanced with features such as history that are found in code versioning systems like Git, with dependency management and deployment features found in the likes of Maven and Gradle.
MLFlow was announced last June, and it already has about 50 contributors from a number of organizations also using it in production. Zaharia said they are making good progress with MLFlow, and at this point, the goal is to get lots of feedback and improve MLFlow until they are happy with it.
Besides being able to deploy ML models on Spark and Delta, MLFlow can also export them as REST services to be run on any platform, or on Kubernetes via Docker containerization. Cloud environments are also supported, currently AWS SageMaker and Azure ML, leveraging advanced capabilities such as A/B testing offered by those platforms.
Zaharia noted that the goal is to make sure models can be packaged to applications -- for example, mobile applications. There are different ways to do this, he added, such as exporting the model as a Java class, but not a standard way, and this is a gap MLFlow aims to address.
The future of machine learning is distributed
If you are familiar with ML model deployment, you may know about PMML and PFA. PMML and PFA are existing standards for packaging ML models for deployment. Discussing differentiation with these was the connection to the other initiative Databricks is working on: Project Hydrogen.
Project Hydrogen's goal is to unify state-of-the-art AI and big data in Apache Spark. What this means in practice is unifying data and execution; offering a way for different ML frameworks to exchange data, and to standardize the training and inference process.
For the data part, Project Hydrogen builds on Apache Arrow. Apache Arrow is a common effort to represent big data in memory for maximum performance and interoperability. Zaharia noted that it already supports some data types, and can be expanded to more: "We can do better."
So, why not reuse PMML/PFA for the execution part? Two words, according to Zaharia: Distributed training. Zaharia noted that while PMML / PFA are geared toward packaging models for deployment, and there is some integration with these, both have limitations. In fact, he added, there is no standard model serialization format which really cuts it right now:
"ONNX is a new one. People also talk about Tensorflow graphs, but none of them covers everything. Tensorflow graphs does not cover things like random forest. PMML does not cover deep learning very well.
In MLFlow, we view these via a more basic interface, like 'my model is a function with some libraries i need to install.' So ,we don't care about how the model chooses to store bits, but about what we need to install.
We can support distributed training via something like MPI. This is a very standard way to build High Performance Computing (HPC) jobs. It's been around for 20 years, and it works!"
This author can testify to both claims, as MPI was what we used to do HPC research exactly 20 years ago. Zaharia went on to add that where possible they would like to reuse existing community contributions, citing for example Horovod, an open-source framework for distributed ML built by Uber.
Zaharia noted that Horovod is a more efficient way to communicate in distributed deep learning using MPI, and it works with Tensorflow and PyTorch: "To use this, you need to run an MPI job and feed it data, and you need to think how to partition the data."
The part where Zaharia mentioned exporting ML models as Java classes was a good opportunity to discuss programming language support and adoption patterns on Spark. Overall, Zaharia's observations are in line with the sentiment in the community:
"I think we mostly see Python, R, and Java in data science and machine learning projects, and then there is a drop-off.
In MLFlow we started with just Python, and added Java, Scala, and R. Usage varies by use case, which is why we try to support as many as possible. The most common especially for new ML projects tends to be Python, but there are many domains where R has amazing libraries and people use it. In other domains, especially for large scale deployments, people use Java or Scala."
This was also a good opportunity to discuss Apache Beam. Beam is a project that aims to abstract streaming processing via a platform-agnostic API, so that it can be portable. Beam has recently added a mechanism to support programming in other languages besides its native Java, and it is what Apache Flink, a key competitor to Spark, is using to add Python support.
Last time we talked, Databricks was not interested in dedicating resources to support Beam, so we wondered whether the possibility of adding support for more programming languages via Beam could change that. Not really, as it turns out.
Zaharia maintained the best way to do streaming on Spark is to use Spark structured streaming directly, although third-party integration with Beam exists. But he did acknowledge that the option of supporting many different languages via Beam is interesting.
He also added, however, that as opposed to Spark, where additional language support was done a posteriori, in MLFlow, REST support enables people to build a package for example using Julia now if they so wish.
Zaharia also commented on the introduction of ACID by Apache Flink, and what this means for Spark, especially in view of data Artisans' pending patent. Zaharia was puzzled as to what exactly could be patented. He noted that streaming that worked with Postgres, for example, has been around since the early 2000s, and exactly once semantics has been supported by Spark streaming since its initial release:
"When Spark talks about exactly once, that is transactional. Delta also supports transactions with a variety of systems, like Hive or HDFS. Perhaps the patent covers a specific distribution pattern or storage format. But in any case transactions are important, this matters in production."
As for Databricks cloud-only strategy, Zaharia noted it's working out quite well. Sometimes. it's Spark users migrating to the Databricks platform. Other times, it's line-of-business requirements that dictate a cloud-first approach, but in any case, it seems Spark has established a strong enough foothold in a relatively short time. And with Spark continuing to innovate, there are no signs of slowing down on the horizon.