Apache Cassandra’s road to the cloud

Taking open source databases to the cloud usually involves proprietary extensions. Could the Apache Cassandra community rewrite the script?


DataStax has recently released its managed cloud service Astra, and the Apache community has just released the beta for its next generation 4.0 platform. With 4.0 pretty feature-complete, has the time come for the Apache community to tackle extending the open source platform to the cloud?

The cloud has become a natural home for popular open source databases thanks to a natural confluence of forces. Enterprises seek the operational simplicity of the cloud while cloud providers can promote the automated management and housekeeping of open source databases as a way to provide unique value add over what is otherwise a commodity, easily-copied core technology. So, when you use Amazon's, Azure's, or Google's managed database services for MySQL or PostgreSQL, the databases might be open source, but the management tier is proprietary to the cloud provider.

There are already several competing cloud-managed Cassandra services: Astra, from DataStax, and Amazon Keyspaces from AWS, both of which became generally available in the past few months. And don't forget Instaclustr, which has been running its own managed Apache Cassandra services in the cloud for at least the last 5 – 6 years.

So, the question could be asked, why should the Apache open source Cassandra project get involved if cloud services are already available?

There are several reasons why we're asking this question. There are now open source technologies like Kubernetes that have become de facto standards, making the possibility of consensus thinkable. Secondly, DataStax is seeking to realign with the open source Apache community after years of estrangement. They have developed their own Kubernetes operators for their Astra managed DataStax Enterprise service, but now want to go back to the community to see if there might be a more standard way.

DataStax chief strategy officer Sam Ramji opened the door in a recent post on what it would take to make Apache Cassandra cloud-native. In a few words, he summarized it as four areas: gateway, operations, management, and deployment, and described how DataStax built its K8s operator. Last March, DataStax publicly released the source code for its K8s operator.

As in any standards or open source project, the chief determinant isn't necessarily technology, but people and culture. We spoke with several members of the community, and it appears they may be receptive to working alongside DataStax.

The mood is a certainly lot different from a few years back when we weathered a Twitter storm after posting about DataStax olive branch overtures to the community. It helps that in the interim, DataStax changed out its top management team to help get matters off to a fresh start. Instaclustr, which began containerizing its cloud implementation of Cassandra prior to the emergence of Kubernetes is keeping its hand in the effort to see if the results could streamline its implementation. At this point, AWS, which offers KeySpaces but is not currently part of the community, is the major odd man out on the K8s operator part of the project.

What makes the goal of open sourcing cloud-native extensions to Cassandra is emergence of Kubernetes and related technologies. The fact that all of these technologies are open source and that Kubernetes has become the de facto standard for container orchestration has made it thinkable for herds of cats to converge, at least around a common API. And enterprises embracing the cloud has created demand for something to happen, now.

A cloud-native special interest group has formed within the Apache Cassandra community and is still at the early stages of scoping out the task; this is not part of the official Apache project. at least yet.

Of course, the Apache Cassandra community had to get its own house in order first. As Steven J. Vaughan-Nichols recounted in his exhaustive post, Apache Cassandra 4.0 is quite definitive, not only in its feature-completeness, but also in the thoroughness with which it has fleshed out the bugs to make it production-ready. Unlike previous dot zero versions, when Cassandra 4.0 goes GA, it will be production-ready. The 4.0 release hardens the platform with faster data streaming, not only to boost replication performance between clusters, but make failover more robust. But 4.0 stopped short about anything to do with Kubernetes.

According to Ben Bromhead, CTO of Instaclustr and an active member of the community, the community is just starting to sink its teeth into the problem. Besides DataStax, several members such as Instaclustr and Orange have already written their own operators, but realize going forward, that there's little value in each member having to update and maintain them independently.

Rahul Singh, who heads Cassandra consultancy Anant and curates content for the community, noted that the effort is currently at the Special Interest Group (SIG) stage, meaning it's not yet formally part of the open source project. The first task will be specifying custom resource definitions for the Cassandra operator to ensure that everybody is using the same syntax.

Ultimately, the goal will be a common Kubernetes operator, regardless of whether the implementation is the pure Apache open source version or a particular vendors. As this is still in exploratory stage, there aren't any timelines for what will get delivered when. But the fact that an open source project is actually extending its reach to cloud-native implementation is a definitely a change to the usual script.