Spark Summit 2018 Preview: Putting AI up front, and giving R and Python programmers more respect

As Spark gets ready for its second act, Databricks is aiming to expand the footprint of its cloud service for handling machine learning and deep learning workloads. And it's taking steps to improve performance for R and Python programmers.

Video: Machine learning: What it is and why it matters

It shouldn't be surprising given the media spotlight on artificial intelligence, but AI will be all over the keynote and session schedule for this year's Spark + AI Summit.

The irony, of course, is that while Spark has become known as a workhorse for data engineering workloads, its original claim to fame was that it put machine learning on the same engine as SQL, streaming, and graph. But Spark has also had its share of impedance mismatch issues, such as making R and Python programs first-class citizens, or adapting to more compute-intensive processing of AI models. Of course, that hasn't stopped adventurous souls from breaking new ground.

Hold those thoughts for a moment.

Databricks, the company whose founders created the Apache Spark project, has sought to ride Spark's original claim to fame as a unified compute engine by billing itself as a unified analytics platform. Over the past year, Databricks has addressed some gaps -- such as Delta, which added the long-missing persistence layer to its cloud analytics service -- and expanded its reach with Azure Databricks.

This week at Spark Summit, Databricks is announcing that Delta will hit general release later this month. The guiding notion for Delta improving reliability is that it provides the landing zone for data pipelines, providing a more scalable option for staging and manipulating data compared to DataFrame or RDD constructs, which were never meant for anything except marshalling data for processing.

Delta is not a data warehouse in that it stores data as columnar Parquet files rather than database tables. So it's an offering where the primary audience is data scientists who work in schema on read mode, rather than, say, business users working curated Tableau extracts -- although there's nothing stopping data scientists from generating those extracts for populating dashboards or alerts.

With Delta, Databricks addresses the data bottlenecks of its Spark compute cloud service. As to the bigger issues in the room, Databricks detailed the longer-term vision for the Spark project and for its commercial platform.

For instance, with Apache Spark having been written in Scala and optimized for running Scala or Java programs, this often left R and Python developers out in the cold. In some cases, the problem is that APIs for Python have not always kept pace with those for Java or Scala. For instance, while Spark's DataFrames were patterned after those developed for R, they are not compatible. In many cases, getting libraries from the R or Python communities to run efficiently on Spark required significant tuning or workarounds.

Databricks is creating runtimes for machine learning and deep learning frameworks such as Scikit-Learn, Anaconda, MXnet, CNTK, and Horovod that will encapsulate all of the dependencies. Additionally, Databricks is creating runtimes that support GPUs on AWS and Azure. For Python developers, the Apache Spark project will introduce UDFs for running popular Python frameworks such as the Pandas libraries that should accelerate performance.

With AI in the spotlight, it shouldn't be surprising that several product enhancements to the Databricks platform, and initiatives for the Apache Spark project, were unveiled. MLflow is a new open source framework to be introduced by Databricks for managing the machine learning lifecycle. It will work across Databricks and other cloud PaaS services to package code, execute, and compare hundreds of parallel experiments, and manage related steps in the lifecycle from data prep to monitoring the runtimes. Not part of the Spark project per se, MLflow will integrate with Spark, SciKit-Learn, TensorFlow, and other open source machine learning frameworks. It echoes similar capabilities provided by rival offerings like IBM Watson Studio and Cloudera Data Science Workbench.

In turn, a new Apache Spark initiative, Project Hydrogen is being announced for addressing Spark's disconnect with deep learning (a name change from its former Oxygen code name). As mentioned before, the crux of the matter is the gulf between IOPS-intensive Spark jobs and the compute-intensive nature of deep learning runs, especially those involving multi-layered neural networks. Project Hydrogen will have separate components, addressing three specific challenges.

The first will target the handoff from ETL data pipelines to the deep learning model, by devising a more efficient means for marshalling data. The second task will adjust Spark task scheduling to be more compatible with the message passing interfaces (MPI) involved with massively parallel supercomputing (the type of compute associated with DL jobs). The last element of Project Hydrogen will be to make Spark GPU and FPGA aware down to the task level. The significance of task level is that it is much more granular than jobs, so if only part of a deep learning problem needs special hardware, the rest of the job could be routed to more economical CPUs. There's as yet no target date for Hydrogen delivery, but the first bits will probably start coming out later this year.

The need for Project Hydrogen reflects the fact that deep learning was hardly on the horizon when Spark emerged roughly five years ago. It is a key indicator that when it comes to big data analytics, we're far from the point where we can close the Patent Office.