AWS re:Invent 2019 Postmortem: Introducing Cassandra and bringing ML to Aurora

AWS unveiled a large number of database announcements last week. Among them, a new service that could shake up the Apache Cassandra market, and bringing ML to Aurora.

cassandra.png

It seems like barely a year goes by without AWS introducing yet another new database. Given the flurry of announcements last week, it was easy to miss. The highlights ranged from embedding AWS cloud services to 5G; announcing more custom silicon; bringing Kubernetes support to AWS Fargate; introducing  an expanded IDE for Amazon SageMaker; making S3 access control more granular; new developments with Amazon Echo; adding machine learning to the call center; not to mention blowback from the recent Defense Department JEDI contract.

But beyond all the headlines, databases continue to be AWS's fastest growing business. Since it was introduced back in 2014, Aurora has been the fastest grower, and before that, it was Redshift. Behind all those machine learning, IoT, AR and VR applications, and gaming technologies lies data.

And so at re:Invent last week, AWS added database number 15 to the portfolio: the Amazon Managed Apache Cassandra Service (MCS). It comes atop a flurry of announcements adding new integrations between some of its database platforms and AWS's AutoML services, S3 data lakes, and new query federation capabilities. Asha Barbaschow provided a quick summary of several of the headliners last week; now that the dust has settled from last week, today we'll give you a deeper dive on Cassandra and Aurora. Tomorrow, we'll turn our focus to Redshift.

AWS enters the Cassandra space

The headline was announcement of the MCS preview. Over the past few years, AWS had seen demand for addressing MongoDB and graph workloads, and now it was finally Apache Cassandra's turn. While the use cases for Cassandra are similar to those of the Amazon DynamoDB platform (both are distributed databases), the choice will likely be driven by developer preference. Until now, the Cassandra has had a loyal developer community. But as a platform that lacked easy-to-use tooling, the Cassandra community has not grown as fast as more accessible NoSQL platforms such as MongoDB.

Amazon MCS right now is in preview. By comparison, DataStax, which until now has been the primary commercial provider for Cassandra, recently introduced a managed Apache Cassandra service that is currently in beta. So, there's a race on to get the products out the door. Both are based on Apache Cassandra version 3.11, with the primary difference for now being that AWS's offering is serverless and will be integrated with its existing cloud management services, such as AWS Identity and Access Management (IAM) for access management, Key Management Service (KMS) for encryption at rest, and Amazon Cloud Watch for monitoring.

The significance is not just that AWS just added a new database offering. It's that the emergence of managed cloud services is shining a new spotlight on Cassandra, and potentially, the Cassandra community. It's a community that, as we learned firsthand a few years back, has had its share of dysfunction. AWS hopes to reach out to the community, but it's won't be alone. With DataStax' hiring of Google veterans Chet Kapoor as CEO and Sam Ramji as chief strategy officer, 2020 should prove an eventful year for Apache Cassandra.

Bringing ML to Aurora

AWS also made announcements extending its Aurora transaction and Redshift analytic databases. Aurora has added integration with some of AWS's AutoML services, including SageMaker and Comprehend, so models could run directly on data stored in the database. That should save developers a few steps, as before this, it would have required moving data in and out of the database to run the models. With the new integration, developers can write SQL queries that can call a SageMaker or Comprehend model, and then visualize the results in AWS QuickSight.

AWS is hardly the only database provider building in integration to machine learning, but it's still early days. We're starting to see competing databases start down the path of using SQL to call ML models, but not all of these capabilities are available in the cloud -- yet. By comparison, Microsoft introduced in-database R and Python as user-defined functions to SQL Server 2016 and 2017, and is starting to make R services available in the cloud Azure SQL platform. Google BigQuery also allows developers to build and call models using SQL, while Teradata is supporting SQL access to its own pre-built machine learning libraries from SQL. This is just a taste of what's to come.

And one last thing

In its decade-plus in business, AWS has built an extremely wide portfolio, five categories, 16 instance families, and 44 instance types of AWS infrastructure. Short of having cookbooks, it's got to be a pretty daunting task for selecting just the right instance for a given workload. We've railed in the past that this is a problem that demands a machine learning approach to listen and profile your workloads.

Last year at re:Invent, we met up with Accelerite, which was about to release a tool for analyzing cloud infrastructure utilization, which is used for rightsizing your VMs. But we've been waiting for AWS to step up to the plate here, as it controls and obviously has firsthand knowledge of its own portfolio. A few months back, it took the first step, releasing a tool for identifying underutilized EC2 instances. They've now taken the next step with AWS Compute Optimizer, which can be launched from the AWS Management Console, tracking resource consumption from the CloudWatch console and using ML to provide a list of recommendations regarding which compute instances to use.