A closer look at Amazon Keyspaces

With AWS taking the Cassandra database serverless, will it finally reduce the hurdles to this popular but complex data platform? And how does it compare with Apache Cassandra and DynamoDB?

cassandra-ii.jpg

After years in the waiting, Cassandra is joining open source counterparts MySQL, MariaDB, PostgreSQL, and MongoDB as cloud Database-as-a-Service (DBaaS). As Asha Barbaschow reported, Amazon Keyspaces for Apache Cassandra is now hitting general release. It sets the stage for a real differentiation in what was previously a gap in the market. Where there were no managed Cassandra database cloud services, now there will be at least two. With DataStax already having a service in beta based on completely different design approach, the market have a real choice.

As Barbaschow noted, Keyspaces has been designed to make it straightforward to migrate on-premises workloads to the cloud. But the real question is whether this will make Cassandra, a database that has never been known for its ease of use, become accessible to a broader audience?

Diamond in the rough

It's ironic. Apache Cassandra was arguably the first NoSQL platform to introduce a truly distributed operational database into the wild. But it's also one of the last to get its own managed DBaaS (Database-as-a-Service) cloud service, which is something for which AWS – and DataStax – have gotten plenty of demand. Both have had managed services in preview over the past few months, and now, AWS has gotten it ready for release. AWS offers a native, optimized implementation of Cassandra that it terms a "serverless Apache Cassandra-compatible service."

Cassandra's strength has always been its support for high scale and performance as one of the first truly distributed databases to support multi-master operation. Its prime challenge was that the database could be very complex to implement.

For instance, tasks such as setup; backups; and garbage collection and compaction (key to maintaining data consistency in a distributed database) required sophisticated skills because of the low-level tooling. Part of the challenge is the inherent complexity of designing databases that are highly distributed. And then there is the challenge of how to model the data. In the relational world, you would design tables based on anticipated queries, and indexes as shortcuts for finding data to eliminate sorting overhead. In Cassandra, best practice is also to lay out the data based on expected read and write patterns, but there are some important differences. You can index data in Cassandra, but in actuality, denormalizing data across the cluster (or multiple clusters) is the better practice. As with any distributed, denormalized system, there is the question of getting the right workload balancing.

What about DataStax?

All this comes as DataStax is on the home stretch of readying its Astra managed cloud Cassandra service that we expect will likely debut on Google Cloud. And with all this, it will continue to be business as usual for DataStax Enterprise on AWS, that will remain supported on EC2.

We expect DataStax to initially offer the pure Apache implementation (rather than DataStax Enterprise) designed for multiple public clouds once it emerges from beta. While AWS's implementation will differ, it is looking at opportunities where it could contribute features back to the open source community.

Amazon's approach

AWS is seeking to simplify matters by offering Keyspaces as a serverless offering. In doing so, it is stealing a page from DynamoDB, which is also serverless. As managed service for an open source database, AWS is taking an approach that comes straight out of its Amazon Aurora and DocumentDB playbooks: implement an open source database in a cloud-native architecture that separates storage from compute with specific features that are optimized for AWS storage engines. The name Keyspaces refers to the top-level database container that controls replication of database objects in Apache Cassandra

By going serverless, it makes life simpler by dispensing with the tasks of provisioning, patching, and managing servers; it also does away with the need to run compactions manually because it has its own storage optimization that dispenses with the need to use Apache Cassandra's tombstone mechanism for marking off deleted data; this optimization eliminates the need for provisioning more storage to continue housing that deleted data. By being serverless, Keyspaces will support autoscaling of compute that is priced either by the number of reads and writes, or by service tier (e.g., the ability to handle 50,000 reads or writes per second).

Being part of the AWS portfolio, Keyspaces will be integrated with its core security, identity, and compliance services such as AWS Identity and Access Management (IAM) for access management; Key Management Service (KMS) for encryption at rest; and Amazon CloudWatch for monitoring.

Like DynamoDB, all data at rest will be encrypted. And, like DynamoDB, Aurora, DocumentDB, Keyspaces will automatically support three replicas that can be distributed across different availability zones (AZs) within a region for the purposes of durability and performance. But there is a subtle difference, as Keyspaces also carries the multi-master capability of Apache Cassandra, a feature not available in Aurora or DocumentDB. While DynamoDB already has a cross-region multi-master capability called Global Tables, at launch Keyspaces will not have cross-region support. But we wouldn't be surprised if a Global Tables-like feature materializes for Keyspaces down the road.

So, let's take a look at how the new AWS service stacks up against Apache Cassandra, and the data platform against which Cassandra itself is often compared: DynamoDB.

Comparisons with Apache Cassandra

As Keyspaces is an AWS implementation of Cassandra, there are some differences with the Apache platform. For instance, Apache Cassandra can write transactions to any node, regardless of where it is located, whereas for now, Keyspaces can only write to nodes within the same region. Another difference is that at launch, Keyspaces will not have support for all CQL (Cassandra Query Language) functions; AWS states that it omitted CQL functions that would not be compatible with serverless operation along with others that it deemed "experimental."

There are other subtle differences with tablespace and key management, storing of systems tables, and load balancing, range deletes, along with differences in best practices for CQL query tuning and partition sizing. For instance, in Apache Cassandra, best practice for partition sizing is keeping the number of values below 100,000 items and the disk size under 100 Mbytes; conversely, Keyspaces will not have limits. Nonetheless, AWS does enforce limits that limits rows to 1 Mbyte maximum.

Comparisons with DynamoDB

Under the covers, both are very different databases. DynamoDB follows a simpler key-value schema, whereas Cassandra implements a wide column model that is more complex and treats partitions differently. As we noted in our commentary after AWS announced Keyspaces at re:Invent, the use cases for both databases (as distributed, operational platforms) are similar, but that the main difference would likely be that of developer preference.

Originally, DynamoDB was the recommended destination in AWS for distributed NoSQL databases, as it was positioned as a platform that could handle key-value and document data. In fact, Cassandra and DynamoDB have a shared lineage in that the designers of Apache Cassandra applied a number of principles from Amazon's original Dynamo research paper; Amazon's Dynamo and SimpleDB databases were the ancestors of DynamoDB. Since then, AWS has considerably diversified its NoSQL database portfolio with DocumentDB, Neptune, Timestream, ElastiCache, and others to target different use cases and data types.

But Cassandra continued to distinguish itself as a multi-master distributed database, meaning that it could accept writes across instances scattered across different data centers. While AWS says Cassandra was not the role model, a couple years back, it found DynamoDB customers demanding multi-region replication, which was how Global Tables originated.

In developing Keyspaces, AWS took some lessons from DynamoDB; besides serverless operation, it adapted automated partition management for balancing read and write loads into the new service. There are some features, such as plug-ins for authentication of short-term credentials that AWS has already open sourced in GitHub. They may contribute a comparable server-side component to the Apache Cassandra project to enable customers running the database on EC2 to manage access to their clusters similarly.

A bigger stage for Cassandra?

With Keyspaces, Cassandra becomes the latest open source database for which AWS is offering a managed service. In spite of the barriers to entry, Apache Cassandra has become one of the most popular databases out there, ranked eleventh by db-Engines. A managed cloud service should broaden that audience.

But if it was so popular, what took Cassandra so long to get it into the cloud? Look no further than the top five databases on db-Engines; aside from Oracle and SQL Server, open source databases MySQL, PostgreSQL, and MongoDB round out the top five. First things first.

Beyond that, the answer of why so long is also the answer to why a managed cloud service is so badly needed: the complexity of the platform and the lack of decent tooling (we will probably get complaints about that one, but the available tools are not terribly intuitive). The good news is that introduction of a managed cloud service will tackle the infrastructure half of the problem. But the database designer still needs to define the data model, something that a managed service cannot automate on its own. There are some good white papers available, and at AWS, a NoSQL Workbench tool for DynamoDB that could conceivably be adapted to Cassandra. Ultimately, we'd like to see some visual tooling that provides a guided approach to developing the schema. That's the missing link. We're hoping that AWS or DataStax, or preferably both, step up to the plate there.