Spark 2.0 aims to make streaming a first-class citizen

With the 2.0 release, the open-source Apache Spark compute engine has entered adolescence. It has consolidated several APIs in the name of simplification, while adding a few for promoting extensibility and improved performance. And by eroding the wall between real-time and batch, it could push streaming into the core of analytics applications.

Spark has generated huge excitement in the big data community. It has provided the path for making the speed or velocity layer of big data reality, simplified some of the gotchas of MapReduce, and expanded the palette of computations beyond MapReduce, and it does this all through a set of common interfaces. But there remain issues with memory management, coordination with YARN, and setting a level playing field with R and Python developers. With the new release, Spark flattens the Lambda architecture, adds some tweaks for using R, and takes performance up a notch. Yes, Spark's entering puberty.

Spark 2.0, announced last week following a couple months of tech previews, generated few surprises. But the highlight was a new Structured Streaming API that brings interactive SQL query to the real-time Spark Streaming library. It also binds Spark Streaming more closely with the rest of Spark by also integrating with the soon-to-be-replaced MLlib.

In so doing, Structured Streaming could break down the barrier between real-time and batch processing. This is what Databricks terms continuous applications, where real-time processing is used to continually update tables that are being queried, or add real-time extensions to batch jobs. In effect, with a single API, Structured Streaming flattens the Lambda Architecture, which was originally built on the premise that batch and real-time processing had to be kept separate. Spark's original contribution to Lambda was that you could use the same APIs for the batch and "speed" (real-time) compute "layers"; with Structured Streaming, you can unite them in the same process. Not that this has never been done before, but previously, you needed proprietary complex event processing (CEP) engines to pull that off.

Hadoop and Spark: A tale of two cities

It's easy to get excited by the idealism around the shiny new thing. But let's set something straight: ​Spark ain't going to replace Hadoop.

Read More

Although not part of the 2.0 release, we believe that Structured Streaming is also the first step toward adding true streaming (the ability to process a single event at a time) to Spark Streaming (which currently only performs microbatching). As a higher level API, it abstracts the logic from the underlying processing engine, which provides the flexibility to change or alter the architecture of that engine. Stay tuned.

While Structured Streaming adds a new API, the convergence of DataFrames and Datasets will reduce API clutter in 2.0. DataFrames, patterned off similar constructs available for R and Python programmers, and allows data to be manipulated like a database. In turn, Datasets extended that same API to data represented as Java objects. As both were written from the same API, it was logical with Spark 2.0 to merge them, providing a single API with a choice of tabular or object faces. While Spark's original RDD constructs still have the performance edge over DataSets, there is little question that the Spark project will continue tweaking DataSets to narrow the gap.

And with the ascendance of Datasets, will be a simplified way of building machine-learning (ML) programs, Spark.ML. The new library provides templates for building and representing ML programs as multi-step pipelines, so developers don't have to manually code them. And these ML pipelines can be saved and loaded for reuse. Over time, Spark.ML will supersede MLlib as the Spark's primary ML API. For this release, Spark has added support for running user-defined functions and distributed algorithms (e.g., generalized linear models) in R.

Because Spark is best known for its suitability for executing machine-learning programs, it shouldn't be surprising that ML has become the hotspot for competition. For instance, R and Python have their own machine-learning packages that work natively with those languages. And so do the folks at H2O, which claim to have more performant algorithms. Then there's Google TensorFlow, which, for now, is focused on deep learning ML problems; and yep, there's a way to run it on Spark.

Spark 2.0 also provides a preview of emerging technologies. They include Tungsten, a project for replacing the 20-year old JVM with a compiler that promises to tweak performance. The guiding notion is that while Spark data artifacts -- RDDs and the converged Datasets -- accelerated data access, the JVM (with its overhead for garbage collection) bottlenecked the compute. In Spark 2.0, Tungsten is supported for caching and runtime execution.

So, Spark 2.0 tweaks some existing features and opens new directions that depart from its origins with others. Aside from the expanded Datasets, which builds on pre-existing APIs, most of the new features, such as Structured Streaming and Spark.ML, are still alpha-stage technologies -- so we expect shakeout before they start entering production in about six to 12 months. There's growing take-up of the Spark engine, yet competition for the onramps to it, as Microsoft, Continuum Analytics, and H2O offer their own onramps to R that preserve the option for dropping in new engines maybe, just maybe, might be sexier someday.

Big data's biggest problem: It's too hard to get the data in

While big data has been turned into more of a marketing term than a technology, it still has enormous untapped potential. But, one big issue has to get solved first.

Read More