Maybe it was a minor miracle that the weight of a couple ZDnet bloggers didn't sink the pier on which Kafka Summit floated yesterday, but there's little question that Kafka technology is very much afloat. Confluent's most recent annual Kafka survey, published last June, found over 90 percent of survey respondents deemed Kafka as mission-critical to their data infrastructure, and that queries on Stack Overflow grew over 50 percent during the year.
A couple years back, we looked at how Kafka emerged as the big data firehose. Fast forward to the present, and now, even AWS has gotten into the act, introducing its own managed Kafka service to go up against its own Kinesis. As with its recent introduction of DocumentDB and managed Kubernetes services, AWS is not sentimental about protecting its own proprietary offerings when enough customers demand open source. Likewise, Google Cloud, which offers the Dataflow service for building distributed data pipelines, also has a partnership with Confluent for hosting its managed Kafka service.
Big on Data colleague Andrew Brust covered the latest Confluent 5.2 platform release yesterday, showing the latest release adding a number of developer goodies like making C++, Python, Go, and .NET the same first-class citizen status long enjoyed by Java.
So the question is, given the fact that Kafka has become the de facto standard engine for distributed big data messaging, where does Confluent go from here?
Taking note of its popularity, Confluent joined other open source providers in erecting its own licensing walls last year to keep the cloud guys from making money on its own IP. Yes, there is the usual free developer edition, where you can mount the full Confluent stack for a single broker in your sandbox. Beyond that, there is the Confluent Community license that allows the typical open source privileges – except that you cannot turn around and offer your own SaaS cloud service. That's for features like KSQL, connectors, schema registry, and REST proxy. And then there is the proprietary Confluent Enterprise license for features like the management console.
Kafka's success has propelled Confluent into unicorn territory, as Big on Data colleague George Anadiotis reported last January. With a fresh $125 million financing round, the company's valuation has gotten to a crazy $2.5 billion. That prompts comparisons with Databricks, which itself received another $250 million infusion the following month to bring its valuation to $2.75 billion.
Both companies seem to be following similar trajectories. They've both created popular open source technologies that sit in the middle of the stack. But they've thrived for different reasons. Of the two, Kafka is clearly the less glamorous – it provides the underlying plumbing for enabling the types of distributed data pipelines that are used for building massively scaled real-time streaming applications. When Kafka works, you don't see it, but you see the dashboards that work atop it, or you see the models that are consuming all that data. Conversely, Spark gets more of the spotlight, as it includes the compute engines and the vast library of frameworks and algorithms that generate the analytics. While Spark is not the front end, it's a lot closer to it than Kafka.
As to business strategies, Databricks has followed more of the classic open core model, with the elements outside the runtime being outside the bounds of the Apache project, while Confluent, as noted above, has taken a more complex approach.
Reflecting that more of the spotlight shines on Spark, it has drawn far more competition. Spark's IOPS-intensive nature means that it is not always the best tool for the job -- especially for compute-intensive deep learning models where Spark is having to adapt. There are frameworks like H2O that work with, but also independently of Spark. Likewise, the machine learning services offered by cloud providers typically bypass Spark. And while the Apache Spark project has worked to improve the performance of R and Python programming, many users from the Anaconda and CRAN communities may use other execution engines for scaling their models.
By contrast, Kafka, sitting outside the limelight, has drawn far more modest competition. Sure, the Hadoop folks have either moved message brokering directly into the cluster (MapR Streams) or provide utilities that compete with Confluent (hello Hortonworks Dataflow), but their attention these days is focused more on making their platforms cloud-native, rather than trying to compete with Kafka. Instead, the real competition is with the cloud providers who are making their own accommodation.
Or maybe coopetition is the more suitable term. Confluent runs its cloud on AWS and Google Cloud (we're waiting for the time when they go live on Azure), but with Google, the relationship is more formal. We'd like to GCP OEM the Confluent technology to make it a fully supported service alongside Dataflow in the same way that Azure Databricks is now a formal part of the Azure portfolio. While we don't expect any Azure Databricks-like announcements to come from Google Cloud and Confluent next week at NEXT, we'll be curious to see what they have up their sleeves.