Confluent announces Infinite Storage for Apache Kafka

The company founded by Apache Kafka's creators introduces infinite data retention on its Confluent Cloud platform. As a result, the company is pushing the event streaming data technology as a "database of record." That's a big change.

Confluent, Inc. is today announcing infinite data retention as a new feature on its Confluent Cloud-managed Apache Kafka service.  The company, founded by Kafka's creators, is announcing infinite data retention as part of its "Project Metamorphosis," which aims to imbue Kafka with modern cloud properties. Infinite retention rolls out this month for Confluent Cloud Standard and Dedicated Clusters on AWS, with other cloud providers coming later. The feature could really change the way Kafka is used.

Top Cloud Providers

Top cloud providers in 2020: AWS, Microsoft Azure, and Google Cloud, hybrid, SaaS players

The cloud computing race in 2020 will have a definite multi-cloud spin. Here's a look at how the cloud leaders stack up, the hybrid market, and the SaaS players that run your company as well as their latest strategic moves.

Read More

Must read:

Cloudy Kafka

In a briefing with ZDNet, Confluent CEO Jay Kreps explained that modern cloud properties, like elasticity, fully-managed operations, and separation of compute and storage, have largely eluded Kafka. Instead, companies that have adopted Kafka have had to manage a lot of moving parts and, in particular, have had to manage storage very explicitly and diligently.

Customers using Confluent Cloud dedicated clusters have had to pre-provision the storage they've needed and, typically, only a one-week window of data has been kept in Kafka topics. Now data will simply be able to accumulate, without limit. Customers will of course pay for the cloud storage needed to retain this data, but that is much more cost-efficient than using storage on the Kafka cluster nodes themselves. As a result, customers won't have to pre-provision any such node-level storage, and therefore charges for the clusters themselves won't change, with costs continuing to be based on the volume of data ingested and processed.

Take off your coat and stay awhile

Since its inception, the very premise of streaming data systems has been that they act as a transfer point for data, and not a permanent home. As a result, they've typically served up a short window of recent and real-time data, while historical data has had to be retrieved from other systems. So-called "Lambda architectures" that have sought to integrate real-time and historical/batch data platforms, in order to provide applications with both types of data, might be better-characterized as Rube Goldberg architectures.

But with infinite data retention, Confluent is promoting the idea that Kafka can be used as a database of record, rather than just a conduit. Even Confluent would likely not suggest that Kafka replace data warehouses and data lakes, or act as an analytics database platform. But just as relational OLTP (online transactional processing) platforms act as permanent database stores for transactional data, Confluent is saying Kafka can now do the same for event stream data. Essentially, Kafka can be data's permanent home, rather than just a hotel where data stays when it first arrives.

It's a (K)SQL world, we're just living in it

The relational database analogy is more than just conceptually useful. About three years ago, Confluent introduced KSQL, a SQL query layer for Kafka. This feature allows developers to work with Kafka as if were a relational database and Kreps acknowledged KSQL as a major driver for customer demand that Kafka acts as a repository for historical as well as real-time data. 

Also read: Kafka gets SQL with KSQL

While KSQL isn't part of Apache Kafka per se, it is available from Confluent on a community-licensed basis for use with open-source Kafka implementations. On Confluent Cloud, KSQL technology is available as a fully managed service, alongside the core Confluent platform.

Streaming data is becoming increasingly important, just as data protection regulations are creating imperatives around the management of historical data. In this context, infinite data retention for Kafka looks to be a game-changing capability that could make IoT, log, social, and clickstream data as easily query-able and manageable as transactional data has been for quite some time.