Going with the stream: Unbounded data processing with Apache Flink

Streaming is hot in big data, and Apache Flink is one of the key technologies in this space. What makes it different, what new features are included in its latest release, and what is its role in conquering the big data world?
Written by George Anadiotis, Contributor

Previously, we introduced streaming, saw some of the benefits it can bring and discussed some of the architectural options and vendors / engines that can support streaming-oriented solutions. We now focus on one of the key players in this space, Apache Flink, and the commercial entity that employs many of the Flink committers and provides Flink-related services, data Artisans (dA).

We talked with dA CEO and Flink PMC member, Kostas Tzoumas. Tzoumas, who has a solid engineering background and was one of the co-creators of Flink, was keen to elaborate on an array of topics: from the streaming paradigm itself and its significance for applications, to Flink's latest release and roadmap and dA's commercial offering and plans.

Is everything a stream?

Is everything a stream?

A few people, including Tzoumas, have made the case for seeing traditional, bounded data and processing as a special case of its unbounded counterparts. While this may seem like a theoretical construct, its implications can be far-reaching.

As Dean Wampler, author of Fast Data Architectures for Streaming Applications argues, "if everything is considered a "stream" -- either finite (as in batch processing) or unbounded -- then the same infrastructure doesn't just unify the batch and speed layers, but batch processing becomes a subset of stream processing."

This as a paradigm shift, argues Tzoumas, as it means that the database is no longer the keeper of the global truth.

The global truth is in the stream: an always-on, immutable flow of data that is processed by an unbounded processing engine. State becomes a view on that unbounded data, specific to each application and kept locally utilizing whatever storage makes sense for the application.

In this architecture, applications consume streams of unbounded data, but they also use streams to publish their own data that may in turn be consumed by other applications. So the streaming engine becomes the hub of the entire data ecosystem. According to Tzoumas:

"What most people think of when it comes to streaming is applications like real-time analytics or IoT. We do believe that these are super-important and we fully support them, however what unbounded processing has the potential to do is offer a new pathway for all applications whose nature fits the streaming model, and those go way beyond the typical examples one would think of. Basically, these are all applications that periodically update their data.

They may not necessarily be real time -- they may have latency that goes into the hours range, but that's not the point. as long as there is an inflow of data, we see them as streaming data applications. These are operational, not analytics applications. So for me streaming is not in any way confined to the analytics world, or the Hadoop world for that matter."

Tzoumas further says that:

"There is a use case in which the nature of data analysis makes it easier to work with bounded data, and that is data exploration -- when data scientists try to learn something from some static set of data and continuously explore different models and application code. In most cases, this exploration leads to more stable applications that are deployed and then continuously process an unbounded set of data; this is a streaming problem.

You can solve this streaming problem by modeling what is, in reality, an unbounded stream as periodically-arriving bounded data sets. Or the stream might arrive in higher-latency intervals (say every 1 hour). The larger point is that if you look closely at most batch jobs and how data comes in and out, you will see a pipeline around them that actually handles unbounded data. This is a streaming problem in nature, and is best solved by streaming technology."

Tzoumas offers two reasons for this:

1) Time management. You can manage time correctly (by using event time and watermarks), so you can group records correctly, based on when an event occurred, and not just artificially, based on when an event was ingested or processed (which is very often wrong).

2) State management. By modeling your problem as a streaming problem, you can keep state across boundaries. So two events that arrive in different time intervals but still belong to the same logical group can still be correlated with each other.

Why Flink?

Streaming according to Flink

This may sound compelling, but why go for Flink when there are so many alternatives out there?

Why would anyone choose Flink over Spark in specific, which is enjoying wide popularity and vendor support? After all, if it's latency you're worried about, as Tom Reilly, Cloudera's CEO put it, "we're able to offer sub-second responses and we don't hear any complaints from customers."

Spark is a really good community, but Spark as a platform has some fundamental problems when it comes to streaming, argues Tzoumas.

"It's not about sub-second responses, it's about how to approach a continuous application that needs to keep state. one approach is what we are doing with checkpoints, the approach that Spark has with RDD and micro-batching presents many problems.

Latency is one of them, managing this in production is another. Imagine that every five seconds you have to bring up and down 500 Spark workers. Every five seconds, forever!

If you have a simple, stateless application that is part of a processing pipeline, for example moving data from Kafka to HBase, then it is possible. But when you go to more complex, stateful applications that also need to connect to a database, this becomes impossible. You can't reconnect to your database every five seconds.

Spark was one of the first streaming engines out there, so it was used a lot for applications that are part of data ingestion pipelines. and it works fine for that. But as soon as you move to stateful applications over streaming data, you have a problem."

So, how does Flink avoid this problem? Flink provides the facilities to manage and persist state, and state can also be exported. Currently the format used for this is Flink-specific, but the roadmap includes opening up to other formats too.

The benefit of this approach is that it enables state rollbacks, which combined with version control and DevOps makes for a powerful mechanism: applications can roll back, effectively moving back in time and replaying their processing, using the same data and codebase that were operational at a certain point in time, or even alternating different versions of the codebase and data at will.

Flink periodically snapshots state to a persistent storage in-flight, utilizing a distributed algorithm. These state snapshots can be used for fault tolerance and rollback, and Flink supports different State Backends: in-memory, file-based, or db-based. There is also a checkpoint store which is pluggable, while RocksDB is used internally. If an application goes down, Flink resumes by reloading state from a shared file system.

Flink's new release 1.2, which will be officially announced on February 6, includes an array of new features on Queryable State, Asynchronous I/O, Security, Metrics, Table API & StreamSQL, Dynamic scaling, and Cluster management.

Flink 1.2 new features

"Our experience with DevOps in organizations working with cluster management shows that DevOps in production is a hard problem," says Tzoumas.

"I wish I could tell you that it's easy, but it's not. This is something that we really plan to focus our work on in order to make it as easy as possible. So we want to offer better and tighter integration with the all environments Flink can run in, with monitoring, migration, updates, maintenance, the whole lifecycle."

Flink Forward

data Artisans

So what to make from all that? Should you skip switching to Spark, or throw it away and switch to Flink instead? And, supposing you're sold on the idea, how would that work? According to Tzoumas:

"Most people we talk to are not trying to move away from Spark, they are starting to think in terms of streams and are trying to figure ways to process their streams.
In our experience, Spark to Flink migration is something a number of people have tried and were successful at. However at the moment we don't have a clear migration path from Spark to Flink in terms of best practices, as we have not had enough demand for it. But we do have a Storm compatibility layer that people can use.
Porting code and stitching together your data pipeline however should be the least of your worries in such scenarios. It's managing that pipeline in long running, high availability production systems that's the really hard part. It's all about DevOps, because it's not like a batch job. If a batch job fails, you can run it again. If a live application feeding into other live applications fails, it's much more mission critical."

But there's more to enterprise technology than technology itself. Is Flink battle-tested, and can you get the kind of support for it that Hadoop vendors for example can offer?

If companies like Alibaba or Netflix are any measurement at all, the answer would be "yes...but not exactly". Elaborating on dA's relationship with Alibaba, a massive retailer that brands itself as a "data company", Tzoumas says:

"Alibaba observed that taking into account real-time information in search would make a difference in their bottom line, and they used Flink to improve their search.

Alibaba started with their own fork of Flink (which they called Blink), and, since last summer, they have been contributing significant functionality back to Flink, gradually moving closer to the main Flink branch. Tens of great engineers at Alibaba are working with and contributing back to Flink!

Alibaba is a great relationship for us. We have similar relationships with other companies, such as Netflix, Uber, and King.com, that shared their insights in the last Flink Forward conference.

However, I would not classify these companies as the typical enterprise customer. They are simply operating at a much larger scale and have more specialized needs than most enterprises. More typical enterprises that we are engaged with and we support come from the banking and broader financial sector, e-commerce, internet, telecommunications, etc."

For those more typical enterprises, dA released its own platform in September 2016. At this point it is essentially a mirror of the open source (OS) Flink platform, but it comes pre-integrated with all the nuts and bolts that support a fast data architecture and with enterprise level support, Quality Assurance, and Service Level Agreements.

It's a stable, controlled version of the Flink platform which gives dA the flexibility to quickly provide patches and fixes for enterprise clients without having to go through the OS community process, even though eventually fixes generated through this process find their way to Flink. dA however is not committed to be a pure OS play: the roadmap for dA platform's evolution includes proprietary extensions too.

Tzoumas argues that "not everything is in the scope of Flink, and we don't want to start a gazillion OS projects, as the point is to not just have code but a community. In addition, from the business development perspective, looking at the margins of pure open source companies and combining it with the messages we are getting from the market, we think that this may not be the best way to move forward."

And moving forward dA is. One year and several versions ago, Forrester labelled dA a strong performer, and noted that dA "has great potential, but the company needs a healthy dose of investment capital to increase its marketing, sales, and professional management."

Started by a core team of five former academic researchers in Berlin in 2014, which is still dA's HQ, dA has been progressing in strides. dA now employs 20 people and has already received a Seed round of 1 million euros by b-to-v Partners and Series A of 5.5 million euros by Intel Capital with participation by Tengelmann Ventures and b-to-v Partners.

The US is next in line: dA has also incorporated in the US, and Flink is having its flagship Flink Forward event in San Francisco. According to Tzoumas Flink has a big and vibrant US community, but at the moment dA only has ex-Twitter's Jamie Grier in their payroll there. Grier has been really successful helping clients run Flink in production, but regardless of how good he might be, one person can only do so much.

dA has also on occasion gone to market with Hadoop vendors. Many organizations deploy Flink on YARN, so to make the best of these scenarios "they asked dA and Hadoop vendors to work together, and that's what we did. But it goes beyond this ad-hoc type of collaboration, as dA has partnerships in place with most Hadoop vendors."

Tzoumas has also co-authored a book with MapR's Ellen Friedman, and points to the fact that Flink can run both on-premise and in the cloud and is part of two (cloud-based) Hadoop distributions -- Amazon EMR and Google Dataproc -- as well as Lightbend's fast data platform.

dA seems particularly aligned in terms of philosophy and technical approach with Google's take on streaming.

"We've been involved with Apache Beam before it was called Apache Beam, and we're proud to have two people employed at dA as active Beam PMC members. Flink's DataStream API follows the Dataflow model, as does Apache Beam, and we are maintaining and supporting the Beam Flink runner, the most advanced runner beyond Google's proprietary Dataflow runner. Going forward, I do expect Beam and Flink to cross-pollinate," says Tzoumas.

Tzoumas sums up dA's vision and philosophy:

"Our core team has been together from the start, and our CEO-CTO duo with me and Stephan (Ewan) has been working great. We're not trying to monetize every part of Flink possible, or to throw money at problems: doubling your team size does not mean doubling your efficiency.

We're trying to grow the market organically, spread the unbounded data processing paradigm and grow ourselves through this. This isn't just some crazy architecture we drew up on a whiteboard -- it's something we've seen implemented in production, and we'll be working with the teams that built these applications to share more about the use cases.

So we are not going to do this alone, no way. Streaming data is a clear trend, and Flink is a key part of it. It's the best thing you can use right now for any kind of compute on streaming data, so it's really in the center of this evolution, but it's not the complete evolution. There are a lot of other very good players out there."

How to implement AI and machine learning:

Editorial standards