Apache Flink: Does the world need another streaming engine?
While it has yet to draw critical mass commercial support, Apache Flink promises to fill a gap not addressed by other open source streaming engines: adding replay and rollback to your streaming application.
Ever since the days of complex event processing, we've been predicting that the future of streaming was bright. Nearly five years ago we stated, "The same data explosion that created the urgency for Big Data is also generating demand for making the data instantly actionable." And at the start of this year, we declared that "IoT is the use case that will push real time streaming onto the front burner."
And the hits just keep on coming. A month back, we took a look at DataTorrent, the company behind the Apache Apex project, which has applied a visual IDE to make streaming more accessible to application developers.
In the interest of hearing everyone out, let's take a look at Apache Flink. We had an interesting chat with data Artisans CEO Kostas Tzoumas, the company behind Flink that offers a platform for developing and managing Flink-based applications.
Compared to open source streaming engines like Apex, Storm, or Heron, Flink does more than streaming. It is more like the reverse image of Apache Spark in that both put real-time and batch on the same engine, doing away with the need for a Lambda architecture. Both have their APIs for querying data in tables, and both have APIs or libraries for batch and real-time processing, along with graph and machine learning.
So where does Flink depart from Spark? Flink is a compute engine built for streaming and extended for batch, while Spark was built for batching and its version of streaming: micro-batching (that will eventually change). To some extent, the same metaphor could be made for Storm (with its Trident API), but it lacks the libraries and support for critical features like exactly-once event processing (e.g., guarantees that an event will be processed only once) and has scaling limitations.
With such comparisons, it's tempting to state that when it comes to big data compute, the Spark train has left the station. While the backers of Flink claim it is one of the top five Apache Big Data projects, the Spark community has the committers and commit activity for being number one. Spark also has garnered commercial vendor support; virtually all Hadoop distributions support it as do all the major cloud providers. And a growing roster of analytics and data preparation tools is baking in Spark under the hood.
So why are we having this conversation? With Spark having first mover advantage to the market, the Flink folks are deliberately not trying to be all things to all people. They are not targeting interactive analytics or building of complex machine learning models. And, although Flink is built on a streaming engine, streaming analytics are not necessarily topping the list.
Instead, Flink is targeting stateful applications, because unlike the other open source streaming engines, it supports time stamping of events. That means applications built on streaming data pipelines can be rolled back or replayed.
In so doing, Flink is targeting a capability normally reserved for databases: maintaining stateful applications. Applications, implementing on Flink as microservices, would manage the state. For real-time applications, avoiding the I/O of the database, reduces latency and overhead. This is not such a new idea; with n-tier applications, state was often managed by the Java middleware layer. And if you dig back even deeper in time, transaction monitors (the forerunner of application servers) abstracted that capability from the database.
Still, in an era where fast databases built on in-memory or all-Flash storage have become practical, there remain a number of familiar real-time use cases where moving state maintenance out of the database might still make a difference. A sampling could include managing IoT networks, keeping current with e-commerce clickstreams, real-time recommendation engines, travel reservations, and connected cars. This doesn't mean that Flink will replace databases, as the data (or an extract of it) may eventually get persisted; the difference is that the a key component of processing some types of real-time transactions (not the ones that require full ACID support) is performed before the data gets to the database.
While commercial support for Flink is still a work in progress, it has drawn the beginnings of grassroots support with over 12,000 meetup members worldwide. MapR and data Artisans have co-authored an excellent detailed introduction to Apache Flink that can be freely downloaded,. And even if we thought that Spark settled the issue on merging real-time and batch, the Apache Beam project, implemented in Google'sCloud Dataflow service, is reopening the argument by providing a data in motion processing environment where you can swap in and out the real-time compute engine(s) of choice.