The ability to ingest and process large volumes of data in real time is something interesting to more and more organizations. This is an area seeing rapid growth, as the use cases can translate to direct business benefits.
We have been following this space for a while now, and the release of Apache Samza 1.0 is a good opportunity to revisit it and see how this changes things, if at all.
Born and raised in LinkedIn
Apache Samza was developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014, and now, it is used by over 3,000 applications in production at LinkedIn. The use cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more.
If you are somewhat familiar with this space, this should ring a bell. Apache Kafka, another key framework for real time data processing, was also developed at LinkedIn. Kafka became the standard transport mechanism for tracking data at LinkedIn, and with loads of data generated daily into Kafka, applications needed to obtain insights from it.
These applications dealt with a common set of problems as they consumed messages from Kafka, such as checkpointing, management of local state, handling failures, scaling-out processing, etc. Apache Samza was built to tackle these problems in stream processing. The thing is, however, that this may have made sense then. Does it still make sense today?
This may seem like an odd question to ask. After all, the inflow of data has not stopped -- on the contrary, it has increased. What is different now is that there are many other frameworks around built to deal with those exact same needs: Apex, Flink, Spark, and Storm, to list just the Apache open-source projects. We also have a few proprietary solutions.
Samarth Shetty, Apache Samza team lead at LinkedIn, said that when they were first dealing with stream processing, there were very few existing stream processing frameworks that could handle the scale or the technical problems they have at LinkedIn. For that reason, they determined that it would be best to build their own framework:
"For example, we had to build capabilities such as Incremental Checkpoints and Host Affinity in Samza. These were not available at that time in frameworks such as Apache Flink. Kafka Streams is suited for processing events in Kafka but at LinkedIn, we need an ability to process events from/to different systems such as Azure EventHubs, AWS Kinesis etc.
Samza provides this ability to easily connect to different systems. Stream processing is a very important use case at LinkedIn, so we were committed to building the best framework possible. Over time, we believe we have built on one of the most advanced stateful stream processing frameworks that is best suited to handle LinkedIn's need for scale."
This somehow resembles the situation we have with machine learning frameworks: There's just so many of them out there. Having many options is not a bad thing, of course, but the line between "many" and "too many" is a thin one. Soumith Chintala, Facebook's team lead for PyTorch, replied along similar lines: We wanted something that worked for us, so we built it.
A Beaming API
As opposed to Kafka and Confluent, however, Samza never grew into a product with a vendor of its own. Shetty said there is a group of engineers dedicated to stream processing at LinkedIn, and Samza is an integral part of that work. He went on to add, however, that they don't see Samza as a product:
"We see it as an open source project. LinkedIn has always been involved with open source in this way -- when we build tools that we think could be valuable outside of the company, we want to give them back to the community when possible.
While there isn't currently a vendor tied to Samza, multiple companies, including Slack, TripAdvisor, Redfin, and Optimizely, are using it in production. Something that we feel attracts other organizations to using Samza is the fact that it is battle-tested at scale, since we use it ourselves at LinkedIn."
As we noted when we last covered streaming, we just saw DataTorrent, the vendor behind Apex, go out of business recently. Perhaps there just isn't enough room for more open-core streaming vendors at this point. So, if you want to use Samza, you will have to figure it out, and run it in production, on your own. That has not changed with version 1.0. But other things have changed, in a way that may affect the broader landscape too.
Most notably, Samza now has a new API, and it's Apache Beam compatible. Apache Beam is an open-source project that provides a unified API, allowing pipelines to be ported across execution engines, including Samza, Spark, or Flink. It also allows for data processing in other languages, including Python, which are heavily used in the data science community.
The Samza team realized at some point that while stability and performance have been Samza's core strengths, its programming API was fairly low-level. Samza offered a simple callback-based API, allowing operations to be specified at the level of an individual message.
Developers had to implement complex operations such as windows and joins by themselves on top of this API. In addition, multiple Samza jobs needed to be wired together using Kafka topics, which made building applications time consuming and error-prone.
Samza 1.0 comes with a high-level API that enables developers to express complex data pipelines easily by combining multiple operators. But the Samza team went one step further, making their API compatible with Apache Beam.
What this means is that now you can write streaming applications on top of Beam using Java, Scala, or Python, and you can port them across all frameworks that support it, and you can run them on Google Cloud Dataflow -- at least in theory.
Both Flink and Spark have proprietary extensions beyond the Beam API, and Spark creators are not very interested in supporting Beam. Beam support on Spark is via community contributions.
But if you stick to core Beam, portability should be possible. In any case, another entry in the list of projects that support Beam may give it some momentum. It may also help Google Cloud Dataflow. This is where Beam was started, and Beam can serve as a way to onboard streaming applications to Google Cloud.
SQL, DevOps, and what this all means
That's not all for Samza 1.0 though, as it comes with some more important new features: SQL and DevOps imporovements. At some point, the Samza team realized that even though their API got an upgrade, APIs are not everyone's cup of tea. Non-engineers would much rather have access via SQL, and Samza is on the way to providing this.
We say on the way, because even though technically SQL works, it still has to be used via API calls. This sort of defeats the purpose of lowering the entry barrier, but Shetty noted they are in the process of creating an interactive shell for SQL, too. SQL support is getting to be table stakes in streaming these days --Kafka, Flink, and Spark have it.
Another area of improvement for Samza 1.0 is cluster manager independence. Prior to Samza 1.0, Samza required YARN for resource management and distributed execution of applications. YARN works, but users wanted the flexibility to run stream processing in any environment. Samza 1.0 addresses this by offering a standalone mode of running applications.
This mode allows Samza to be embedded as a library within an application and run on any resource manager. There's a couple of catches, however: The standalone mode does not yet support stateful stream processing like windowing and joins, and support for Kubernetes is not quite there yet. The Samza team is working to address these. Samza 1.0 also brings support for joining streams with tables, and improves testability of Samza applications.
So, what's the verdict? Samza has made huge progress with release 1.0. The team behind it seems to be aware of its shortcomings, and committed to working on them. They are also aware of its strong points, mentioning for example how Samza has managed to process 1.2M messages/sec on a single machine.
The question is, however: Should you care, or use it? If you are confident that you can set it up, program it, deploy it, and maintain it on your own, then it is definitely worth considering. It may be open source purists' first choice in fact -- what may deter some will appeal to others.
Samza 1.0 is important whether you use it or not. It's important because it gives Apache Beam a leg up, at a time when it seemed to be in a stalemate. Not every organization is LinkedIn, not every organization has the need, or the luxury, of building and maintaining its own framework. But developing on top of Beam seems like the best bet for portability at this point.
And even though LinkedIn does not seem to be giving Samza up, using Beam means that they could even do that, and be able to port their applications seamlessly on the API level. That's no small feat.