The past, present, and future of streaming: Flink, Spark, and the gang

Reactive, real-time applications require real-time, eventful data flows. This is the premise on which a number of streaming frameworks have proliferated. The latest milestone was adding ACID capabilities, so let us take stock of where we are in this journey down the stream -- or river.

Spark: The big data tool du jour is getting automation Spark is the hottest big data tool around, and most Hadoop users are moving towards using it in production. Problem is, programming and tuning Spark is hard. But Pepperdata and Alpine Data bring solutions to lighten the load.

The Next IT Transformation

What you need to know before implementing edge computing

These are the questions your firm should ask before going down the route of edge analytics and processing.

Read More

Streaming is one of the top trends we've been keeping up with. The latest episode in that saga was adding ACID capabilities to Apache Flink, as covered by ZDNet's Tony Baer last week. This announcement, made at Flink Forward in Berlin, was the backdrop for in-depth conversations we had with executives, engineers, and users, which may help put things in context.

To begin with, as Baer noted, there is an API for Flink that can be downloaded from GitHub, but it only works for a single stream. The version with the "runner" for multiple parallel streams is part of the data Artisans Platform - the commercial incarnation of Flink.

Also: Apache Flink takes ACID

This is not at all surprising, as data Artisans, the vendor that provides support for Flink and employs a big part of its full-time contributors has an open core policy. That's a very common policy in the open source world, and one that data Artisans/Flink's main competitor, Databricks / Apache Spark, is also taking.

How many streaming engines does the world need?

As Baer would say, how many streaming engines does the world need? Good question, which may also be rephrased as two follow-up questions: How many vendors can survive doing what data Artisans and Databricks do, or how do you choose a streaming engine?

The answer to the first question is exactly two, at this point: data Artisans and Databricks. A third competitor, DataTorrent, and its Apache Apex engine, which we covered a while back, went belly up. Seems like the unusual "we'll do anything including building on our competitor's engine" message was one last effort to stay afloat by adopting an approach more apt to a consultancy than a vendor behind an open source project.

Also: Real-time applications are going places

Either way, this means there are a number of orphans in the open-source streaming solutions space now: Platforms without a vendor to provide support, a hardened version, and steer their development. Besides Apex, the list also includes Apache Storm and Apache Samza. Storm is older and more mature than Samza, and also has some support from Hortonworks.

Hortonworks' core business is not streaming, however, and if you want to use Storm and have enterprise support levels, it seems you'll have to go for the entire Hortonworks stack, too. We don't know whether Hortonworks has plans to step up for Storm, but we don't have any such signals at this point.

There also are a number of closed-source solutions for streaming, but it looks like they have an uphill battle to fight. They may have their merits and customer base to show for, but much of that is based on legacy contracts and relationships. In a "try before you buy," fast-paced, open-source world, and an expanding market for streaming, winning new contracts won't be easy.

Also: The Future of the Future: Spark and Big Data Insights

And then we also have the cloud vendors, of course: AWS with Kinesis, Google Cloud with Dataflow, and Azure with Stream Analytics. The usual motif plays out here, as well. These engines may or may not be the ones best suited to your needs. But if you're already using AWS, Google Cloud, or Azure, they will make it really easy and tempting for you to sign up and integrate their streaming solution in your applications.

Streaming engines adoption and competition

Discussing the streaming market with Kostas Tzoumas, data Artisans' CEO, Tzoumas was clear about what he sees as the biggest competition for data Artisans: Legacy. Tzoumas deliberately refrained from comparing data Artisans/Flink to other options, focusing instead on their efforts to reach out and scale up in terms of evangelizing and sales.

His views resonated with many Flink Forward attendants, including some of data Artisans most high-profile clients. Delegates with loads of technical hands-on experience from the likes of Alibaba, Netflix, and Microsoft, all emphasized that changing the paradigm and learning to work with streaming is something they have to master and spread the word for every day.

Also: We interrupt this revolution: Apache Spark changes the rules

Some of their comments were around things such as the need to have streaming work with all the reliability that is a given in the batch world, to learn to program in a more thoughtful way compared to single-threaded applications, and to raise the abstraction level. data Artisans seems to be listening, judging from what is in its agenda.


The evolution of streaming. (Image: Data Artisans)

We already mentioned the introduction of ACID to cater for reliability, which was to a large extent driven by the requirements of large financial and eCommerce organizations that use the data Artisans Platform. Another major bet for Flink is the advance toward the unification of APIs for streaming and batch, which Alibaba has been working on and is about to be integrated in the core Flink codebase.

Also: Spark Summit 2018 Preview: Putting AI up front

Flink has a number of APIs -- data streams, data sets, process functions, the table API, and as of late, SQL, which developers can use for different aspects of their processing. Ideally, people would like to use SQL for everything. This would not only simplify the lives of developers, but also make Flink more approachable for non-technical users.

The need to make data Artisans sustainable may have something to do with other choices made too. The fact that data Artisans Platform is not available in the cloud, for example, is a striking difference with Databricks, which touts a cloud-only strategy for its own platform, playing the iPaaS card.

But when your main clients are behemoths with their own infrastructure, as seems to be the case for data Artisans, offering them a cloud version makes less sense. That may also explain Tzoumas' comment when he said that they do not compete with Databricks/Spark much. Not that Flink is not attractive for smaller organizations, but the story of using Flink plus some support and consulting, rather than the data Artisans Platform, was one we heard more often from them.

Data Artisans and Apache Flink going forward

Apache Flink's (twin) versions 1.4 and 1.5 were of the kind to introduce somewhat unglamorous, not very popular, but highly needed improvements. They were all about production deployment and stability options, and they meant some backwards compatibility had to be broken. This is why we heard many users still rolling with 1.3, even though improvements in 1.6, mostly in streaming SQL, tempted some to take the plunge and upgrade.

Also: The top 10 big data frameworks used in the enterprise TechRepublic

Now, that hard, unglamorous work is mostly over. One important part that data Artisans aims to address is the containerization of Flink, or being able to use it as a library with Docker and Kubernetes, in what they call Reactive mode.

Other items in the agenda for the near future include auto-scaling, time-versioned table joins (a much needed feature in a world where data is constantly updated), and SQL for pattern analysis. SQL has been extnded with the MATCH_RECOGNIZE capability toward this end, and data Artisans wants to bring this to Flink.

Another interesting direction is opening up to Python via Apache Beam. Although Beam and Flink are conceptually rather close, as data Artisans CTO Stephan Ewen noted up to now Flink did not have any tangible benefits to reap by being aligned with Beam. But support for Python is changing that.

Beam is introducing a framework through which APIs in languages other than Java can be supported, and Python is the first one. According to the Apache Beam people, this comes without unbearable compromises in execution speed compared to Java -- something like 10 percent in the scenarios they have been able to test.

This means that Flink can now be programmed in Python, too, via Beam, which is rather important given the prevalence of Python for data science and machine learning scenarios. Ewen acknowledged this, noting, however, that it is not about to give up Java anytime soon.

Also: Hortonworks unveils roadmap to make Hadoop cloud-native

Databricks/Spark on the other hand has had support for Python for a while now, which may help explain what we perceive as a broad differentiation between the two platforms: Flink is used more as a fast processing stateful engine, with ACID reinforcing its position as the integration hub for the real-time enterprise, while Spark is used more as a data science -- analytics backbone, with Python and notebook integration contributing to its popularity.

Of course, there are overlaps, and things are not as clear-cut as that. In any case, it is worth noting that data Artisans ACID support is patented and part of data Artisans Platform, which means that unlike stateful streaming, Databricks will not be able to introduce it in its own platform as easily. Regardless, Databricks and Spark have been making progress on their own trajectory, and we will be sharing more on that soon.

Previous and related coverage:

Cisco folds Spark into Webex as Webex Teams

Webex Teams takes all of the collaboration features in Cisco Spark and offers them alongside features based on the Webex conferencing platform.

This startup thinks it knows how to speed up real-time analytics on tons of data

Making sense of the vast amounts of data gathered by businesses is a problem for business that Iguazio says it's cracked.

Apache Flink: Does the world need another streaming engine?

While it has yet to draw critical mass commercial support, Apache Flink promises to fill a gap not addressed by other open source streaming engines: adding replay and rollback to your streaming application.

Going with the stream: Unbounded data processing with Apache Flink

Streaming is hot in big data, and Apache Flink is one of the key technologies in this space. What makes it different, what new features are included in its latest release, and what is its role in conquering the big data world?