Apache Beam and Spark: New coopetition for squashing the Lambda Architecture?

While Google has its own agenda with Apache Beam, could it provide the elusive common on-ramp to streaming?
Written by Tony Baer (dbInsight), Contributor

The nice thing about open source projects and standards is that there are so many of them to choose from. And on January 10, the Apache community welcomed Beam as its latest "top level" project (getting top level means your project has made it to prime time in Apache).

Beam is the latest manifestation of Google's newfound open technology strategy. Google traditionally kept its technology to itself, typically publishing research papers that the open source community would then reinvent under clean room conditions. That's how HDFS, Hadoop's foundational file system, and MapReduce got started.

But of late, Google is sharing its goodies, a development that more than coincidentally parallels its growing seriousness in competing with Amazon and Microsoft in the cloud computing business. As challenger to Amazon, Google can't rely alone on proprietary technologies -- it needs to make some bets that will grow viral with developers, with open source providing the likeliest on-ramp.

While Android has long been Google's best known open source project, TensorFlow and Kubernetes are more instrumental to drawing customers to the Google cloud. Actually, Google makes that point verbatim in its Why Apache Beam blog.

Beam is an API that separates the building of a data processing pipeline from the actual engine on which it would run. It includes abstractions for specifying the data pipeline, the actual data stream (akin to Spark's RDDs), transformation functions, "runners" (the compute engine), and the sources and targets. It's one of a growing number of approaches for flattening the Lambda architecture, so you can combine real time and batch processing (and interactive as well) on the same code base and cluster.

While it is still early days, most functions supported by Beam can be executed in Google Cloud Dataflow along with open source Spark, Flink, and Apex engines. The operable notion, write your code once and then run it everywhere (or at least on the compute engine of choice), sounds awfully redolent of Java's original promise. Hopefully Google will be a bit more successful than Sun on this go round.

Before breaking into song, keep in mind that just as Apache YARN was spun out of MapReduce, Beam extracts the SDK and dataflow model from Google's own Cloud Dataflow service. On closer inspection, support for the Beam programming model is roughly 70 percent complete across Apache Apex, Flink, and Spark Streaming engines. The biggest gap is with functions dependent on true stream processing (the ability to process one event at time and set discrete time windows), where Spark Streaming's microbatch capabilities either fall short or require workarounds.

But as streaming technology is a moving target, Spark's Structured Streaming, part of Spark 2.0, will refactor Spark Streaming so that true streaming will soon be supported, making some of Google's points moot.

At the end of the day, this is all about which engine is going to become your frame of reference, or unifier. It's a battle as old as your company's software stack: who is your company's strategic IT supplier? Your database, enterprise application, or infrastructure provider? Now add to that, which is your strategic big data compute engine? What skills will you recruit and train your team on?

Each of these compute engines -- Google Cloud Dataflow, Spark, Flink, and Apex, all want to be your one-stop shop. And that's where Beam becomes coopetition with Spark -- it will work with Spark, but theoretically, it will work with other engines. And if successful, it could displace Spark from being your primary on-ramp to big data computing.

Spark has had the advantage of head start -- there are hundreds of libraries, not to mention a fast growing skills base. Spark doesn't do everything -- for instance, while it has SQL, engines such as Impala or HAWQ are still more efficient for random access, and interactive query.

Has the Spark train already left the station? With Beam, Google wants to give you the option of saying, "not so fast." And of course, Google then rubs it in with benchmarks showing how its own Cloud Dataflow processing beats Spark in performance and coding efficiency (Google's comparison, published roughly a year ago, doesn't yet factor Spark 2.0, or Structured Streaming in particular).

For developers, the question is whether they want to learn yet one more layer of abstraction to their coding. On one hand, there's the elusive promise of a common API to streaming engines that in theory should let you mix and match, or swap in and swap out. Google has designed Beam to be portable so you can move streaming workloads to and from Cloud Dataflow itself.

But, like any Switzerland-style API, assuring cross-compatibility is the big hurdle. The idea of abstracting logic from execution is hardly new -- it was the dream of SOA. And the recent emergence of microservices and containers shows that the dream still lives on.

But the notion of a common API to streaming engines could come in handy given that the market has not settled on any of the engine(s) as default standard. If Beam is successful, it could provide developers a useful way to hedge their bets on streaming engines, but for now, that's a pretty tall order.

What's behind the trend of companies moving from public to hybrid cloud:

Editorial standards