Big data is a moving target, and it comes in waves: before the dust from each wave has settled, new waves in data processing paradigms rise. Streaming, aka real-time / unbounded data processing, is one of these new paradigms.
Many organizations have adopted it already, others have it in their radar, and almost every prediction for 2017 mentions it in one way or another. And if you're one to go with analyst recommendations, Forrester sees real-time as a key step on the way to a real-time, agile, self-service data platform.
In this introduction to streaming we take a look at use cases and benefits, gloss over different architectural choices and their implications, and present a brief map of the solutions landscape.
So, what's all the fuss about? Why would anyone go into the trouble of learning about yet another trend in big data, let alone investing the time and resources to adopt and implement it in their organization?
The short answer is, because real-time processing and analytics bears the promise of making organizations more efficient and opening up new opportunities.
The most familiar use case would be retail. In a recent discussion with Ted Orme, Attunity VP Technology EMEA, Orme mentioned the case of a major retailer able to cut down on their abandoned online transactions by 80 percent by applying real-time processing.
Instead of collecting operational data for analytics to be applied at a later point in time, the ability to apply analytics on the fly made it possible to identify transactions that were abandoned due to issues in credit card processing in real-time.
By doing so, the retailer was able to have agents contact customers whose transactions were abandoned and offer them help to complete their purchases. This resulted in a dramatic drop in the percentage of abandoned transactions: from 5 percent to 1 percent.
The bigger the volume and value of transactions, the bigger the increase in revenue, but the math adds up even for mid-sized retailers: for an average of 100 transactions per day worth $10 each, this would result in $1200 additional sales per month.
But this is just one example of the benefits real-time processing and analytics can bring to organizations. Another example is the combination of real-time processing with IoT. As Andrew Brust wrote a few weeks back, IoT sensor data has the potential to be used for applications that go beyond the typical predictive maintenance scenario.
Case in point, automating agriculture. Sensors collect and send environmental data in real-time, and that data needs to be processed in real-time as well in order to identify and respond promptly to needs identified in the field. This can result in more efficient, scientifically optimized agriculture with less manual labor involved.
IoT is a domain that aptly demonstrates the advantages of real-time processing, but it's not the only one. Ellen Friedman, MapR solutions consultant, author, and Apache Foundation committer, one of the leading experts in streaming real-time processing, has identified more use case domains: advertising, financial services, government, healthcare and life sciences, manufacturing, oil and gas, and telecoms.
But the applications are virtually endless -- whatever your domain, chances are the ability to gain operational insights in real-time could enable previously unattainable scenarios.
Lambda Kappa Alpha Beta Gaga
As Friedman puts it, "life doesn't happen in batches". A good deal of the processing in big data infrastructure does however.
According to Doug Cutting, the father of Hadoop, "it wasn't as though Hadoop was architected around batch because we felt batch was best. Rather, batch, MapReduce in particular, was a natural first step because it was relatively easy to implement and provided great value."
So now that (batch) big data is everywhere, has the time come to move on to the next stage and drop batch for good?
Not so fast, says Cutting: "I don't think there will be any giant shift toward streaming. Rather, streaming now joins the suite of processing options that folks have at their disposal." And this is where things start getting messy and technical, and opinions and architectures start diverging.
To the uninitiated, Lambda and Kappa, the 2 competing architectures for real-time processing, sound as greek as Alpha Beta Gaga. What are the differences, and why does it matter?
The Lambda Architecture was introduced in 2011 by Nathan Marz. It features a hybrid model that uses a batch layer for large-scale analytics over all historical data, a speed layer for low-latency processing of newly arrived data (often with approximate results) and a serving layer to provide a query/view capability that unifies the batch and speed layers.
The main drawback critics of the Lambda Architecture have pointed out is the fact that "without a tool [..] that can be used to implement batch and streaming jobs, you find yourself implementing logic twice: once using the tools for the batch layer and again using the tools for the speed layer. The serving layer typically requires custom tools as well, to integrate the two sources of data."
This translates to more complexity, less responsiveness and increased cost of maintenance, which is why the Lambda architecture is considered by many to be a transitional model.
As Dean Wampler, author of Fast Data Architectures for Streaming Applications argues, "if everything is considered a "stream" -- either finite (as in batch processing) or unbounded -- then the same infrastructure doesn't just unify the batch and speed layers, but batch processing becomes a subset of stream processing."
This has given rise to a modified architecture called Kappa, or simply fast data architecture, which also introduces a different line of thinking: if everything is a stream, then we have a new paradigm, and new paradigms come with their own definitions.
Tyler Akidau, Staff Software Engineer at Google and heavily involved in Google's streaming engine and model, has argued for the use of the terms bounded and unbounded data and processing to clarify and actually replace the term "streaming":
Unbounded data: A type of ever-growing, essentially infinite data set. These are often referred to as "streaming data." However, the terms streaming or batch are problematic when applied to data sets, because [..] they imply the use of a certain type of execution engine for processing those data sets. The key distinction between the two types of data sets in question is, in reality, their finiteness, and it's thus preferable to characterize them by terms that capture this distinction. As such, I will refer to infinite "streaming" data sets as unbounded data, and finite "batch" data sets as bounded data.
Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data. As much as I personally like the use of the term streaming to describe this type of data processing, its use in this context again implies the employment of a streaming execution engine, which is at best misleading; repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived (and conversely, well-designed streaming systems are more than capable of handling "batch" workloads over bounded data).
Keeping it real-time
Great, more definitions. Does this matter, when what you really want is to be able to implement, uhm, unbounded data processing? It kind of does, according to unbounded processing pundits. But before we get to who they are and what their arguments are, let us clarify something: messaging does not equal streaming.
Note that in the Kappa architecture, there are two different components for data ingestion and data processing. Why is that and what is their difference?
"Streams of data arrive into the system over sockets from other servers within the environment or from outside, such as telemetry feeds from IoT devices in the field, social network feeds like the Twitter "firehose," etc. These streams are ingested into a distributed Kafka cluster for scalable, durable, temporary storage. Kafka is the backbone of the architecture", as per Wampler.
So data enters the processing pipeline via message queue infrastructure which serves as a sort of in-flight storage and distribution hub. Message queues (MQ) like Kafka are not new, and Kafka is not the only game in town.
Systems like IBM MQ Series and various implementations of the Java JMS specification like Apache Camel or RabbitMQ have been around for a long time, are well-used and understood and can be seen as the predecessors of unbounded data processing.
MQs are the gateways to an unbounded processing system. What happens next is complicated, but "for low-latency stream processing, the most robust mechanism is to ingest data from [Kafka] into the stream processing engine. There are many engines currently vying for attention," according to Wampler.
According to Kostas Tzoumas, PMC member of Apache Flink, one of these many engines, "it all starts with a streaming-first architecture where all data is in the form of immutable, never-ending streams. In this architecture, state is not centralized, but every application or service keeps its own local state. The role of Flink in the streaming-first architecture is that it provides tools for application developers to develop stateful applications.
Most interesting data sets in the real world are unbounded in nature, or at least this is how the data is born. Think of trades, transactions, clicks, machine measurements, logs; all these are unbounded data sets. if you look closely at most batch jobs and how data comes in and out, you will see a pipeline around them that actually handles unbounded data. This is a streaming problem in nature, and is best solved by streaming technology."
This will get us into the somewhat unclear waters of comparing these many engines vying for attention, and what each one has to offer in terms of architecture, supported use cases, vendor offerings and so on. This goes beyond the scope of a single post, but the interested reader may refer to reports written by Forrester and Bloor that try to shed light on the topic.
Although Forrester and Bloor unsurprisingly have somewhat different takes on the domain, they share one thing in common: they focus on the use of streaming for analytics. Clearly, analytics solutions for unbounded data rely on an underlying engine, but unbounded data processing is not only applicable to analytics.
Furthermore, not all engines have the same architecture, which means they do not necessarily support the same use cases, or do the same things in the same way.
Currently, the most popular among these engines are all open-source Apache projects: Apex, Beam, Flink, Storm, and Spark. Their proliferation speaks volumes on the increaased adoption and importance of unbounded processing, however if we were to pick the one engine out those which is currently getting the most attention, that would be Spark.
This comes as the result of a number of factors, and will be the subject of further analysis in the future. But we will continue with in-depth coverage of unbounded processing, the engines, architectures, models and vendors behind it, as well as real-world use cases in our next post, featuring exclusive news from one of the key players in this space.
How can you turn big data into business insight?