Video - Spark: The big data tool du jour is getting automation
Streaming is becoming more and more important, gaining traction and adoption by the day. In addition to the typical use cases when it comes to real-time data driven applications, namely financial services and IoT, other areas such as retail and programmatic advertising are now carrying the torch.
In the last few months, we have seen a flurry of new developments in this field from key players such as Apache Kafka and Apache Spark. This time, it's Apache Flink's turn to make a move, announcing not one, but two new versions: 1.4, which is out
today on the first week of December, and 1.5, which will be out in the beginning of 2018.
Why the slightly unusual release schedule, what's in the agenda, and where does it all fit in the big picture? ZDNet discussed with Apache Flink PMC member and dataArtisans CTO Stephan Ewen.
Last time we checked, Flink had just released version 1.2. Although that was only a few months ago, lots of water has flowed under the bridge, and it seems Flink, as well as dataArtisans (dA), are going places. dA is the commercial entity that employs many of Flink's committers and provides Flink-related services and products.
To begin with, Flink has some pretty big use cases to show for, such as Alibaba and Uber, which besides the usual bragging rights, have some technical interest. There is also a surprise potential partnership in the making with none other than Dell - EMC, which, again, is interesting for anyone mapping out this domain.
Alibaba has been a key Flink user for a while anyway, the difference is that now Flink and Alibaba are closer than ever. As many organizations of its size, Alibaba is using lots of proprietary technology. For example, Alibaba has its own message bus implementation and its own deployment environment.
This means that many of the standard integrations that are part of Flink do not work for Alibaba, so a fork was created. This is not an ideal situation, as forks mean efforts may be duplicated by having people work on the same problems and it becomes harder to maintain and integrate separate codebases. Flink people report that after having been working closely with Alibaba, near-parity is there and the community can benefit from Alibaba's contribution.
This is the kind of iconic big scale customer any platform is proud to have in its list, and Flink is no exception. Flink goes to some length to detail for example how Flink powered Single's Day, what has been called the world's biggest shopping spree.
Speaking of iconic customers and open source, the name Uber also comes to mind. Uber decided to base its recently released open-source streaming analytics platform, called AthenaX, on Flink. When discussing the exact nature of the platform, Ewen's interpretation was that AthenaX built on top of Flink and has additional integrations and components including monitoring and cross data center failover, which are tailored to Uber's infrastructure.
Uber is as Uber does, and no one but Uber could interpret Uber's real intentions with AthenaX. Ewen said he does not consider AthenaX as competition for Flink, but it's hard to think of a really good reason why Uber would open source something that is only good for Uber. Flink is working on getting Uber to contribute back to core Flink, so time will tell whether this space may have an unlikely entry.
And then there's also Dell EMC. Not as a client, mind you, but as a wannabe contestant. Not many people have heard the name Pravega before, so seeing it as an integration in the list of new features for Flink may raise some eyebrows. Pravega is an open source project that states it "provides a new storage abstraction -- a stream -- for continuous and unbounded data."
What's EMC Dell to do with it? Although there's no official announcement yet, a look at Pravega's team leaves no doubt as to where this is coming from. Ewen had some praise for Pravega, saying that its architecture is super interesting and it has many things right in terms of how to build a message queue log today, taking into account lessons learned from Kafka.
Pravega team reached out saying they are building a product and would like to use Flink as the compute engine. The Flink team want to be storage-agnostic, are willing to work with anyone, and integration was not a big deal partly due to the fact that Flink and Pravega concepts are aligned. Although it's early to tell and there was no confirmation, don't be surprised to see a Dell EMC offering in the space too.
As for Flink's own new version, what better place to start than SQL. SQL on streaming data is not the same as relational SQL, as the semantics need to be redefined to include things such as time windows for results. Flink has supported SQL since 1.2, and Ewen said it works by viewing streams as tables and having SQL queries on them that work incrementally: They can filter, add, or transform rows and columns.
As of 1.4, Flink SQL also supports joins between streams. Interestingly, joins between streams and database tables are not supported out of the box. For Flink, everything is a stream, so Ewen said user defined functions can be used to wrap tables and have them participate in joins. This is a common use case for example when looking up values, so it may be added in an upcoming version.
Anything from a Kafka queue to a Parquet file can be wrapped as a stream and participate in joins, and that is a powerful feature, especially considering all the underlying details of time stamping are handled transparently. Ewen emphasizes that this provides unified stream and batch processing, and the fact that it works incrementally means low latency.
Flink is not the only streaming platform supporting SQL however. Kafka has added SQL to its arsenal recently, and Spark has also added a new approach called structured streaming. As you would expect from competing solutions, there is a benchmark, and there is dispute over this benchmark. So, here goes.
Ewen said the benchmark published by Spark, which unsurprisingly shows Spark to outperform both Kafka and Flink, is "completely broken, comparing apples to oranges. It's an old Yahoo benchmark, which ends up running mostly locally and sends very little data over the network. It reflects the difference in focus in the two projects.
He added, "Flink focuses on running extremely large scale workloads robustly with good performance and checkpoints. This benchmark uses 100 keys, which is kind of a joke -- no company has 100 metrics. Try running it with 100 or 500 million keys and the results will be very different".
What else is ripe for some healthy antagonism? Well, standards, of course. Although, in this case, Apache Beam is not exactly a standard. It is seen as an effort to unify streaming across platforms. But it's not exactly catching up. Apache Beam originates from Google's DataFlow model, but it's only fully supported by Google and Flink at this point.
Kafka crew said they are not seeing traction and not interested unless it adds support for tables, Spark crew said it's there and anyone who wants to run it on Spark may well do so, but they are not going to commit any resources to support this. It could be clashing commercial interests or different views or both, but in any case, the point is: Beam is not exactly seeing universal adoption.
Ewen said they looked at DataFlow before embarking to build Flink, liked and reused many of the ideas, and so it's natural that Flink is Beam compatible. One of the criticisms for Beam is that it's complex, which Ewen acknowledged, but he also said other models can only take you so far and eventually this complexity is needed to support more advanced scenarios. Beam is also working on a SQL specification for streaming, which, again, unsurprisingly is close to Flink's view.
Last but not least, there is a bunch of not-so-sexy-yet-useful features in Flink 1.4 and 1.5. But before we set out to describe what they are, let's answer the obvious question: Why two new versions? Ewen said they decided to do this because some of the changes will be disruptive and some not, so they kept the disruptive ones for 1.5 and packaged the rest in 1.4.
There is a common theme here, which is that Flink is going Hadoop-less. In 1.4 the entire dependency mechanism has been reworked, enabling among others to drop things such as HDFS and YARN dependencies when running on AWS. As deployment options are changing, largely due to containers, with Docker and Kubernetes gaining ground on YARN and Mesos, Flink is trying to keep up with the times.
That is the reason why 1.5 will introduce some disruptive changes, mostly having to do with the way Flink is deployed on Kubernetes. When discussing the difficulty of keeping up with every deployment option, Ewen's was that Kubernetes seems to be winning the battle. Recent data by Sumo Logic on AWS deployments confirm that Kubernetes is seeing increased adoption.
Ewen acknowledged 1/5 will break some things for DevOps, but said it will be for the best in the mid-to-long term as it will give them more natural and efficient ways of doing things they have been doing the hard way till now.
Ewen, however, said there will be no incompatibility in terms of APIs or checkpoints. Checkpoints are a key concept in Flink, as they allow applications to be migrated between failover machines or when there are upgrades. In essence, they allow an application to retain not just state from its store, but also incoming data from streams that may not have been processed yet.
Deployment and DevOps is an important theme for Flink. Ewen said that based on their experience with clients most of their time is spent not actually developing application logic, but making sure everything runs smoothly.
That's why dA bundled what it calls an opinionated pack of solutions for running Flink in production, together with industrial strength patches, SLAs, and services such as logging and metrics, and it offers it as the enterprise version of Flink -- the dA platform. dA version 2 is in beta at the moment and will be commercially available in early 2018.
Something it tried to address, which, according to Ewen, turned out to be much cooler than they first thought, was the issue of updating applications in production. Ewen said that unlike batch applications, streaming applications usually run continuously, meaning there needs to be a sophisticated approach to enable them to upgrade without interruption in their operation or loss in their incoming data. A combination of checkpoints and dependency management has enabled them to achieve this.
It may seem strange that Flink is releasing two new versions at such a short interval. It may also seem like there's not that much of groundbreaking features in either. But Flink is making steady progress and getting established in some high-profile places. So is the streaming domain as a whole, and competition and adoption do not show any signs of subsiding.
NOTE: The article has been edited to reflect the change in Flink 1.4 release date, some clarifications received regarding Stephan Ewan's interpretation of AthenaX, as well as reference to Kubernetes use on AWS.
Previous and related coverage
Apache Spark is hailed as being Hadoop's successor, claiming its throne as the hottest Big Data platform. What the founding fathers of Spark are saying and doing about its future and its positioning in the market has never been more timely.
Spark is the hottest big data tool around, and most Hadoop users are moving towards using it in production. Problem is, programming and tuning Spark is hard. But Pepperdata and Alpine Data bring solutions to lighten the load.