Kafka channels the big data firehose

Kafka has emerged as the open source pillar of choice for managing huge torrents of events. The challenge is refining the tooling and raising the game on security beyond basic authentication.
Written by Tony Baer (dbInsight), Contributor

Hadoop and big data platforms were originally known for scale, not speed. But the arrival of high performance compute engines like Spark and streaming engines have cleared the way for bringing batch and real-time processing together.

But what happens when your appetite for big data extends to dozens of sources, and quantities that reflect the traffic levels of a public web site? Or how about dealing with the world of IoT? The existing utilities of the Hadoop project, such as Flume, were set up for ingesting streams one-by-one to HDFS as the common target.

LinkedIn faced this issue back in 2009 when it wanted a more granular, real-time solution for tracking user behavior on its website. The problem was that existing open source messaging alternatives like RabbitMQ and ActiveMQ just didn't scale. Instead, LinkedIn turned to a new twist on an established technology pattern: publish/subscribe (PubSub) messaging.

PubSub messaging systems, which date back to the early nineties, were considered the glue that allowed enterprises to connect new front-end systems with immovable legacy backbone financial or transaction systems. They were typically considered operationally simpler compared to more elaborate enterprise application integration schemes. PubSub is the technology around which Tibco was born.

For LinkedIn, Kafka was the result. It takes PubSub messaging and scales it massively; it can hold and distribute messages arriving at up to millions of records per second. Streams of data are divided into topics pertaining to specific types of activities, entities, or categories. Unlike streaming systems, Kafka doesn't filter messages or records, and unlike legacy messaging systems like IBM MQ, does not perform routing. But if you have a fire hose to deal with, Kafka is your baby.

The typical use case for Kafka is around live monitoring, such as tracking web site activity and user behavior. But the use case that breaks the field wide open for Kafka is the same one that has also brought streaming to the forefront for many organizations: anything having to do with IoT. And Kafka could also be useful for processing real time records (or events) for scenarios ranging from supply chain optimization to public sector uses cases such as real-time tax compliance.

Because Kafka is associated with data in motion, it is often confused with streaming engines. But Kafka acts as traffic cop, serving as ingest point from a stream, or transmission point to a stream.


The challenge however is that, as open source technology, Kafka is fairly bare bones. The core open source technology lacks the type of visual development, configuration, and monitoring environments that would be necessary for wide enterprise adoption.

Confluent is the exception that proves the rule; it provides the polished visual front end and management hub meant for enterprise consumption. It includes an integration framework that includes certified connectors to a several dozen databases, streaming engines, and storage pools and APIs for developing custom connectors; a management console that provides more granular visibility on operations compared to what general-purpose Hadoop panes of glass display; and its own streaming API, just in case you want to have a single hub handle all the integration and stream processing.

In actuality, Kafka's competition isn't other messaging busses, but platforms that perform more of the downstream work of integrating and routing data flows. Confluent's competition in turn is platforms that add the surrounding functions of integration, analytics, and application development to the core messaging piece.

So, while most Hadoop platforms, NoSQL databases, and cloud services support Kafka, many of them offer competing managed services. Amazon Kinesis is a service that manages ingestion and provides an environment for developing streaming data applications and forming SQL queries. Not to be outdone, Google Cloud Dataflow provides a managed environment for providing the services necessary to supporting streaming and batch applications performed on data pipelines; the Apache Beam project takes the API from that service to allow you to mix and match different components, including messaging under a common programming model.

On the Hadoop side, MapR Streams one-ups Confluent by moving message brokering directly into the Hadoop cluster (which MapR is gradually pivoting as a broader big data storage and application platform). Meanwhile, Hortonworks Dataflow does not directly compete with Kafka (it could get fed by Kafka), but provides the utilities for managing the integration and flow of data downstream

Kafka has drawn a wide ecosystem of commercial support and, according to Confluent, roughly a third of the Fortune 500 are already using it. But for Kafka to gain mainstream appeal, it requires a broader tooling ecosystem; aside from Confluent's offering, management tooling is rudimentary. As Netflix's experience shows, there are still teething issues if you want to deploy Kafka at extreme scale in the cloud.

Furthermore, security, in the form of authentication support, has only come recently to Kafka. Just as Hadoop providers have upped their game with additional role-based access and lineage, Kafka requires higher level capabilities that will make security easier to manage, especially as user populations expand to take advantage of massive data flows that this little PubSub engine that could delivers.

Editorial standards