Confluent, Inc. is today announcing infinite data retention as a new feature on its Confluent Cloud-managed Apache Kafka service. The company, founded by Kafka's creators, is announcing infinite data retention as part of its "Project Metamorphosis," which aims to imbue Kafka with modern cloud properties. Infinite retention rolls out this month for Confluent Cloud Standard and Dedicated Clusters on AWS, with other cloud providers coming later. The feature could really change the way Kafka is used.
In a briefing with ZDNet, Confluent CEO Jay Kreps explained that modern cloud properties, like elasticity, fully-managed operations, and separation of compute and storage, have largely eluded Kafka. Instead, companies that have adopted Kafka have had to manage a lot of moving parts and, in particular, have had to manage storage very explicitly and diligently.
Customers using Confluent Cloud dedicated clusters have had to pre-provision the storage they've needed and, typically, only a one-week window of data has been kept in Kafka topics. Now data will simply be able to accumulate, without limit. Customers will of course pay for the cloud storage needed to retain this data, but that is much more cost-efficient than using storage on the Kafka cluster nodes themselves. As a result, customers won't have to pre-provision any such node-level storage, and therefore charges for the clusters themselves won't change, with costs continuing to be based on the volume of data ingested and processed.
Since its inception, the very premise of streaming data systems has been that they act as a transfer point for data, and not a permanent home. As a result, they've typically served up a short window of recent and real-time data, while historical data has had to be retrieved from other systems. So-called "Lambda architectures" that have sought to integrate real-time and historical/batch data platforms, in order to provide applications with both types of data, might be better-characterized as Rube Goldberg architectures.
But with infinite data retention, Confluent is promoting the idea that Kafka can be used as a database of record, rather than just a conduit. Even Confluent would likely not suggest that Kafka replace data warehouses and data lakes, or act as an analytics database platform. But just as relational OLTP (online transactional processing) platforms act as permanent database stores for transactional data, Confluent is saying Kafka can now do the same for event stream data. Essentially, Kafka can be data's permanent home, rather than just a hotel where data stays when it first arrives.
The relational database analogy is more than just conceptually useful. About three years ago, Confluent introduced KSQL, a SQL query layer for Kafka. This feature allows developers to work with Kafka as if were a relational database and Kreps acknowledged KSQL as a major driver for customer demand that Kafka acts as a repository for historical as well as real-time data.
Also read: Kafka gets SQL with KSQL
While KSQL isn't part of Apache Kafka per se, it is available from Confluent on a community-licensed basis for use with open-source Kafka implementations. On Confluent Cloud, KSQL technology is available as a fully managed service, alongside the core Confluent platform.
Streaming data is becoming increasingly important, just as data protection regulations are creating imperatives around the management of historical data. In this context, infinite data retention for Kafka looks to be a game-changing capability that could make IoT, log, social, and clickstream data as easily query-able and manageable as transactional data has been for quite some time.