Apache Kafka, the open-source distributed messaging system, has steadily carved a foothold as the de facto real-time standard for brokering messages in scale-out environments. And if you think you have seen this opener before, it's because you have.
Besides being fellow ZDNet's Tony Baer opener for his piece commenting on Kafka usage survey in July, you've probably read something along these lines elsewhere, or had that feeling yourself. Yes, Kafka is in most whiteboards, but it's mostly the whiteboards of early adopters, was the gist of Baer's analysis.
Kreps indicated his belief that in the last year Kafka has actually gone mainstream. As evidence to back this claim, he cited use cases in four out of five biggest banks in the US, as well as the Bank of Canada: "These are 200 year-old organizations, and they don't just jump at the first technology out of Silicon Valley. We are going mainstream in a big way," Kreps asserted, while also mentioning big retail use cases.
While we have no reason to question these use cases, it's hard to assess whether this translates to adoption in the majority of the market as well. Traditionally, big finance and retail are on the forefront of real-time use case adoption.
Still, it may take a while for this to spill over, so it depends on what one considers "mainstream." Looking at Kafka Summit, however, we see a mix of Confluent staff and household names, which is the norm for events of this magnitude.
But what is driving this adoption? Something pretty low level, which is a pretty big deal, according to Kreps: The ability to integrate disparate systems via messaging, and to do this at scale and in real time. It's not that this is a novel idea - messaging has been around for a while and it's the key premise of Enterprise Service Bus (ESB) solutions for years.
Conceptually, Kafka is not all that different. The difference, Kreps said, is that older systems were not able to handle the scale that Kafka can: "We can scale to trillions of messages. New style, cloud data systems are just better at this, such techniques did not exist before. We benefited as we came around a bit later."
Going cloud and real-time
The cloud is something Kreps emphasized, and the discussion around the latest developments in the field was centered around it. The recent Cloudera - Hortonworks merger, for example, touches upon this as well, according to Kreps.
"It was a smart move. These were two companies competing on the same product, which makes the competition more fierce, ironically. You'd think it's people with different views that compete more fiercely, but it's actually people with similar views. That really showed also in the business model," Kreps said.
Also: Kafka: The story so far
Kreps believes that this competition slowed down progress in core Hadoop, as the need for differentiation resulted in more attention towards edge features. Case in point, he noted, the fact that HDFS, Hadoop's file system, which historically has been a key component of its value proposition, is no longer the most economic way to store loads of data -- cloud storage is now.
This could also be interpreted as a sign of moving away from batch processing that Hadoop started from and more toward real-time processing. Although Hadoop has been gradually grown to a full ecosystem, including streaming engines, the majority of its use cases are still batch-oriented, believes Kreps. How this will evolve, time will tell.
Despite Kreps pointing out the cloud as a gravitational point, and Hadoop actually moving toward it in the last couple of years, Confluent is not going to pursue a cloud-only policy. As opposed to data science workloads, which can be hosted either on premise or in the cloud, the kind of data infrastructure that Kafka provided must work on both, argued Kreps.
Since many organizations still have huge investments in software and infrastructure built over years in their data centers, any move to the cloud will be gradual. Confluent's hosted version of Kafka plus proprietary extensions will continue to work seamlessly with on-premise Kafka or Confluent open source, said Kreps. He also emphasized Kafka support for Kubernetes, noting that any stateful data system has to put in some effort to make this work.
Streaming coopetition and real-time machine learning
In terms of differentiation with other streaming platforms, Kreps pointed out that these are mostly geared toward analytics, while Kafka is infrastructure on which operational systems can be, and are, built. When wondering whether Kafka could also be moving in the analytics direction, Kreps did not give any such indication, and questioned the applicability of real-time machine learning (ML):
Also: An inside look at Apache Kafka adoption TechRepublic
"What is the use of a real-time machine learning platform? When i was in school, ironically the focus of my advisors was real-time ML -- ironically, because ML was not very popular back then, let alone real-time ML.
We were struggling to name a mainstream production system using real-time ML. And the idea of having a ML algorithm retrain itself in real-time is not necessarily positive. Most of the time, the effort is to have enough checks and balances in places to make sure ML really works even when working with batch data.
And if you look at ML algorithms built by people who build databases and infrastructure, they are never as good, which is normal. There is a separate ecosystem for data science, and the best stuff is separate from the big infrastructure projects.
More often than not, Kafka seems to be mentioned in the same breath, or whiteboard, with a number of other systems, including streaming ones. Although some might say this means it will be hard for Kafka to come into its own, its position in those architectures also means it's equally hard to take it out of the equation.
Although no big announcement is reserved for this Kafka Summit, Kafka and Confluent have had a few of those in the last year -- KSQL and version 5.0 being the most prominent ones -- and seems to be well on the way to the mainstream.
Previous and related coverage:
Confluent, the company founded by the creators of streaming data platform Apache Kafka, is announcing a new release today. Confluent Platform 5.0, based on yesterday's release of open source Kafka 2.0, adds enterprise security, new disaster recovery capabilities, lots of developer features, and important IoT support.
Ahead of the Strata conference next month, Hortonworks is focusing on streaming data as it introduces a new Kafka management tool and adds some refinements to its DataFlow product.
Data pipelines were the headline from the third annual survey of Apache Kafka use. Behind anecdotal evidence of a growing user base, Kafka is still at the early adopter stage and skills remain hard to find.
The partnership between Confluent and Google extends the Kafka ecosystem, making it easier to consume with Google Cloud services for machine learning, analytics and more.