Less than six months after its most recent acquisition, Cloudera is filling out its Dataflow streaming integration platform with SQL-based event stream processing. Cloudera is releasing SQL Stream Builder, an addition to the Cloudera Dataflow streaming integration platform that supports SQL processing. It fills a gap in Cloudera's streaming platform in that it provides an entry point for SQL developers to query streaming data. Before this, Cloudera Dataflow was accessible only to Java, Scala, or Python programmers.
Cloudera's SQL streaming engine uses Apache Flink underneath the covers. SQL Stream Builder adds to Cloudera Dataflow, which includes edge processing, real-time data ingestion, along with support for other streaming engines such as Kafka Streams, Spark Streaming, and yes, even Apache Storm (a now-inactive open source project dating back from the Hortonworks days). It can integrate in both directions with Kafka, taking feeds from Kafka topics or creating the views that can be published through Kafka.
Capabilities include syntax checking, error reporting, schema detection, query creation, sampling results, and generating outputs that can include materialized views that are accessible via REST API or PostgreSQL wire protocol. It can automatically detect schema from JSON data sources.
With SQL Stream Builder, Cloudera is the latest to offer a streaming analytics service based on Flink. We posed the question a few years back on whether the world yet needed another streaming engine. Having emerged around the same time as Apache Spark, the latter drew the lions share of limelight. Both are reverse mirror images of each other: Spark was designed for microbatching, and was then extended to support stream processing, while Flink was just the opposite. Spark support grew widespread, with the sweet spot being data transformation, but in the past few years, Flink has been quietly gaining traction, as it was one of the first open source engines that was built for streaming. The Flink squirrel is finally getting its 15 minutes of fame.
For instance, AWS uses Flink as the streaming engine underlying Amazon Kinesis Data Analytics, the SQL streaming service which is the closest comparison to what Cloudera is releasing. The other Flink-based service, from Ververica (formerly data Artisans, now owned by Alibaba), comes from the team that created Flink; their offering uses Flink SQL, which covers only a subset of SQL. Cloudera's SQL uses Apache Calcite, with event time-based functions integrated with Kafka.
Cloudera joins a fairly mature market landscape where there are nearly a couple dozen offerings for integrating, parsing, filtering, and analyzing events or continuous streams, divided across open source and proprietary engines, and query engines requiring programmatic approaches and those offering SQL. For SQL, the stumbling block has been the need to extend the language with sliding windowing capabilities that in most cases were added through proprietary add-on functions.
Nonetheless, SQL query of streaming data is as old as the proprietary event stream processing engines of the early 2000s, of which several continue to survive. Capital markets, transportation companies, and manufacturers led the early wave of event processing. Back then, it was termed "complex event processing," offerings that were based on proprietary event processing engines and proprietary query languages; at that time, SQL was an afterthought, added later. These early event processing implementations were costly, required specialized skills, not to mention the fact that data and compute were expensive.
Since then, of course bandwidth, devices have exploded, and cloud computing has made real-time processing at scale accessible and far more affordable. Real-time events at scale from IoT devices, social networks, online commerce hubs, and public digital infrastructure have raised the urgency with use cases ranging from commerce to capital markets, logistics, public safety, and in this time of COVID, real-time epidemiology. Today, there are dozens of streaming technology platforms, both open source and proprietary available.
In this landscape, Cloudera differentiates Dataflow with support for multiple streaming platforms. It recommends Spark Streaming for scenarios with less demanding latencies, ranging from seconds to minutes; Kafka Streams for applications requiring low latency that are built on microservices; and Flink, supporting low latency, stateful applications with advanced time windowing capabilities. As Flink is suited for stateful use cases, it is a logical match for the applications that SQL developers typically work. At least now, Cloudera Dataflow finally has an onramp for the SQL crowd.