Apache Kafka is a key component in data pipeline architectures when it comes to ingesting data. Confluent, the commercial entity behind Kafka, wants to leverage this position to become a platform for the enterprise and today is announcing a milestone on the road to ubiquity: SQL.
Streaming is hot. The demand for real-time data processing is rising, and streaming vendors are proliferating and competing. Apache Kafka is a key component in many data pipeline architectures, mostly due to its ability to ingest streaming data from a variety of sources in real time.
Confluent, the commercial entity behind Kafka, has the ambition to leverage this position to become a platform of choice for real-time application development in the enterprise. On the road to implementing this vision, Kafka has expanded its reach to include more than data ingestion -- most notably, processing.
In this process, the overlap with other platforms is growing and Confluent seems set on adding features that will enable Kafka to stand out. The world's biggest Kafka Summit taking place in San Francisco today marks the advent of such a feature: Kafka SQL, or KSQL, is a SQL implementation enabling Kafka users to process their streaming data using SQL rather than Java or Python APIs.
Neha Narkhede, Confluent co-founder and CTO, is the one who shared the news with the world in her keynote. Going beyond the announcement, Narkhede discussed KSQL with ZDNet.
SQL over streaming data is different than SQL over data stored in tables in traditional relational databases. Streaming data is unbounded, meaning that queries are running and producing results continuously. If there is the need to produce bounded results, special constructs such as time windows are needed.
This means that by definition SQL for streaming data will be different, however the rationale is that by offering a SQL interface the bar to using streaming data will be significantly lowered: it's easier to learn a couple of additional concepts and constructs while using a familiar language and tools than having to learn and use new APIs programmatically.
Kafka is definitely not the first to offer SQL over streaming, so what is so special about this? Narkhede says that "KSQL is the only streaming SQL engine that is completely interactive and an event-at-a-time streaming SQL engine.
There are a couple others, but they are either micro-batch or require the user to switch between Java/Python code and SQL to write sophisticated stream processing operations. With KSQL, you can write any kind of stream processing operation using a completely interactive SQL interface."
KSQL versus APIs
Narkhede notes however that KSQL is built on top of Kafka's Streams API itself, therefore inheriting its elastic scalability, advanced state management, and fault tolerance:
"KSQL server embeds Kafka Streams and adds on top a distributed SQL engine (including some fancy stuff like automatic byte code generation for query performance) and a REST API for queries and control.
The KSQL server process executes queries. A set of KSQL processes run as a cluster. You can dynamically add more processing capacity by starting more instances of the KSQL server. These instances are fault-tolerant: if one fails, the others will take over its work.
KSQL is also tightly integrated with Apache Kafka and its message delivery guarantees. This includes Apache Kafka's strongest guarantee, exactly-once. Using KSQL configured for exactly-once delivery means you have worry-free stream processing without any data loss or duplication. These guarantees make it simple to build reliable real-time applications without writing code."
Confused? One may need to go back to Kafka's previously introduced interactive queries to make sense of this. What Confluent seems to be saying here is that KSQL comes as the next step in the evolution of interactive queries, promising to free developers from the need to use an API at all to be able to query their streams.
KSQL -- what is it good for?
That sounds like a strong selling point, combined with Kafka's strengths as pointed out by Narkhede. But what do you need to know to be able to use KSQL, and what can you do with it exactly?
Narkhede says that the KSQL syntax takes ANSI SQL as a familiar starting point and enriches it with specialized keywords for manipulating streaming data. One such enrichment is the addition of the STREAM as a first class construct, in addition to the TABLE:
"Both of these constructs represent things that are continuously updated; a stream represents an unbounded sequence of structured facts, and a TABLE represents a continuously updated collection of evolvable facts.
As Kafka is built on the powerful notion of an immutable, durable log there is no support for what would traditionally be considered 'update' functionality. However, we do plan to add support for INSERTs in the future."
Narkhede adds that the preview version supports a wide variety of stream processing operations, from continuous windowed aggregations to stream-table joins and much more. So, what can you do with interactive SQL over streaming data?
Will KSQL queries work as a trigger of sorts, which upon firing will enable further action? Will it be possible to store the key-value pairs the underlying structure of a Kafka message consists of in some external store and process them?
All of the above are possible with KSQL, says Narkhede: "KSQL query results can be directed to new topics, or external data stores using Kafka Connect. This makes KSQL particularly suitable for Streaming ETL where you need to preprocess data before exporting it into a downstream system, be it Hadoop, Elastic, Cassandra or any other data system."
And what about schema, and non-relational data? Kafka has you covered according to Narkhede:
"Messages in Kafka topics often contain non-relational data. For this reason, KSQL provides mechanisms to read a variety of serialized formats, including JSON and AVRO encoded data, and define schema against the fields found within them.
In the initial preview version, the schema information is managed and used by KSQL using its internal catalog. We plan to integrate it with Confluent's Schema Registry as well. Beyond that, it may be more widely leveraged in the future."
The bigger picture
The press release makes it seem like developers are the target audience for KSQL, but that ain't necessarily so. In theory, analysts or Ops or anyone who knows their SQL and have a use for data ingested through Kafka -- and that's a lot of people -- could benefit from KSQL.
One could imagine producing or consuming KSQL queries using using anything from a text editor to the Tableaus of the world.
Narkhede says the target audience for KSQL is any developer who knows SQL and wants to process real-time streams in Kafka. Although she says that most Tableau-like tools cannot consume the never-ending results of continuous queries at this time, she also adds that KSQL can easily be integrated with Grafana for real-time monitoring and alerting:
"In our KSQL demo, we show developers an easy way to define metrics on their Kafka topics and export it using Kafka Connect into Grafana. Also, we're partnering with Arcadia Data on this and they are releasing native visual analytics on real-time, streaming data. The integration with KSQL gives all users advanced visualizations for streaming data use cases, specifically around alerts and time-based data exploration (and drill downs)."
When discussing KSQL's place in the bigger picture however, Narkhede makes it clear that this is about more than making life for developers easier:
"KSQL is a big step forward towards our vision of a streaming-first data architecture with Kafka. In some sense, it significantly broadens the reach of Kafka within an organization by making sophisticated stream processing available, beyond developers, to anyone who can write SQL.
Kafka and KSQL unlock a number of capabilities that were either not possible to do in real-time or were cumbersome. The Kafka log is the core storage abstraction for streaming data, allowing same data that went into your offline data warehouse is to now be available for stream processing.
Everything else is a streaming materialized view over the log, be it various databases, search indexes, or other data serving systems in the company. All data enrichment and ETL needed to create these derived views can now be done in a streaming fashion using KSQL.
Monitoring, security, anomaly and threat detection, analytics, and response to failures can be done in real-time versus when it is too late. All this is available for just about anyone to use through a simple and familiar SQL interface to all your Kafka data: KSQL.
And if you want to use both in-motion and at-rest data, with the Kafka Connect API, it's easy to include data from other source systems into Kafka topics and then join these to data from other topics using KSQL."
This is consistent with Confluent's vision of getting Kafka to become the platform of choice for developing real-time applications. In this vision, the log becomes the center of gravity for data, querying happens on the fly on incoming streaming data, and external stores are used as cold data storage that can in some cases be entirely bypassed.
Although Kafka is not the only one to embrace this vision, its positioning as the entry point for data pipelines makes it an attractive choice to build on . Confluent sees this, and playing on this strength with KSQL may prove a decisive move in the battle for streaming platform domination.
Data lakes can be a great asset, but they need an array of elements to work properly. We take a look at how it works for Montefiore Health System and discuss the role of semantics and graph databases in the data lake architecture.