Paraphrasing Garrison Keillor, it's been a quiet week in the Apache Spark community - at least compared to last year, where the definitive Spark 2.0 was unveiled. Last week, Spark Summit pulled into Boston, and so did one of those nor'easters that make Boston so alluring in February.
And so the Spark project, for now, is engaging in the blocking and tackling chores of cleaning up or optimizing APIs. For instance, a recent update of Spark 2.0 has added pipeline processing for enabling more efficient running of complex machine learning jobs.
And of course while we're on that topic of machine learning, it was virtually impossible to evade presentations covering it. This is 2017 after all. Almost every customer in the enterprise case study track spoke of using machine learning in their solutions, or adding it on their next steps. More often than not, ML was paired with streaming, SQL query, and with graphs built to identify critical interrelationships.
Netflix described its use of Spark ML, the emerging set of machine learning libraries as the core of its personalized recommendation engine that is Spark's future direction. GoDaddy's small business success index, which provides health scores for the effectiveness of its customers' websites, incorporates natural language processing to parse interactions, predictive statistical models to identify topics, and machine learning to identify which content is the most effective.
For others, machine learning is the next step. Capitol One discussed the success of its Second Look app that provides timely alerts to credit card customers for unexpected and potentially mistaken charges. They use Spark in conjunction with Kafka to provide an efficient queuing system for processing incoming streams to screen for suspicious charges. It is looking at adding machine learning as a means for personalizing the alerts in the future.
With Spark 2.0 barely in general availability for less than a year, a key theme of the summit was showing attendees how to take advantage of new features, such as Structured Streaming, which lets you run the same SQL calls against data in motion and data at rest - you can aim the same query at data flowing in through Spark Streaming and data sitting in columnar data stores such as Parquet.
Spark is one of a growing number of paths for collapsing the Lambda Architecture, which specifies the design of separate batch and real-time processing tiers. That's not surprising; the compute engine or data platform that can readily accommodate both needs becomes a good candidate for becoming your gateway platform to big data compute. Others, such as Google with Apache Beam, and MapR and Hortonworks, with their respective data flow management engines, are making such bids.
Much of the anticipation this year was over the plans for UC Berkeley's successor to AMPLab, the research center that gave rise to Spark, plus projects such as Mesos and Alluxio (a.k.a., Tachyon). AMPLab's mission targeted advanced analytics through batch processing. RISELab, the successor, is picking up where AMPLab left off, focusing on secure real-time processing.
And one of the first projects on RISELab's agenda is particularly pertinent for Spark: coming up with a pure streaming engine that's faster than Spark Streaming (which technically performs microbatching, not streaming). The new project, Drizzle, was actually unveiled at Spark Summit West last summer. Early benchmarks show it processing 10x faster than Spark Streaming at 10s of millions of events per second; but the superior performance of Flink at the extreme end of the scale (at the 20 million event mark) shows there's still plenty of work to be done.
One of the findings of Databricks recent Spark survey is the level of community activity. Now that the conference is over, somebody else wants the last word.
Today, a team from Yahoo announced their contribution to the Spark community: the open sourcing of TensorFlowOnSpark. As the unwieldy project name implies, it's about making TensorFlow, the deep learning libraries open sourced by Google last year, to run on Spark.
There was plenty of excitement last summer when Databricks and Google collaborated on TensorFrames, providing a way for TensorFlow to execute via Spark's DataFrame. But according to Yahoo data scientist Andy Feng, the result was a compromise, as performance couldn't equal running TensorFlow natively on the Google Cloud Platform. Their package, TensorFlowOnSpark, has a smaller API, and allows TensorFlow operations to execute asynchronously, without having to go through the bottleneck of the Spark driver. Oh, and if your cluster has a higher-bandwidth Infiniband network, TensorFlowOnSpark can optimize memory management for that.
While Spark 2.0 has made great strides in defining a stable target for developers (as they now know how the APIs are organized), there are still many blanks left to fill. There's little doubt, for instance, that there are many in the R and Python developer communities that would like better optimizations that surpass what's possible with the DataFrame. Putting R and Python programming on a level playing field with Scala will merit its own post.
Either way, while this year's Spark Summit was relatively short on news, it's hardly a sign that it's time to close down the patent office.