ScyllaDB achieves Cassandra feature parity, adds HTAP, cloud, and Kubernetes support

ScyllaDB, the open-source drop-in replacement for Apache Cassandra, is growing up. Version 3.0 closes the gap in terms of features, and has a few extras to add on top of superior performance over Cassandra.

ScyllaDB promises something simple, alluring, and hard to believe: Keep your codebase, replace Cassandra with ScyllaDB, get up to 10-times boost in performance. How can this be? In a nutshell, different implementation language (C++ rather than Java), more low-level programming paradigm (such as memory or socket allocation) via Seastar, and auto-tuning capabilities.

Also: Neuton: A new, disruptive neural network framework for AI applications

That was the story of ScyllaDB 2.0. There were, however, a few features missing from ScyllaDB in order to be an exact drop-in replacement for Cassandra. Now, with version 3.0 announced in ScyllaDB Summit, ScyllaDB not only closes the gap, but embarks on its own journey, starting out with adding HTAP (Hybrid Transactional - Analytic Processing) capabilities and going cloud.

Closing the gap with Cassandra

Let's start with the features that were once missing from ScyllaDB and are now there. Materialized Views, Secondary Indexes, and File Formats may not sound very sexy, but they can make a lot of difference in application development and performance. Dor Laor, ScyllaDB co-founder and CEO, said they dedicated lots of hard work and a big part of their R&D to reach parity in terms of functionality:

"Those three features were long anticipated by many of our users and customers, so it was a no-brainer to invest in them. In general, both Cassandra and its ancestor, DynamoDB, are sound, feature-wise. It's their implementation that wasn't good enough.

For example, our secondary indexes are global and can therefore scale with any cluster size. This functionality not only encourages teams to switch from Cassandra to Scylla, it should influence other NoSQL users to switch to Scylla as well. We have a rich roadmap ahead of us beyond these features and we're excited to continue to evolve our database functionality."

symbolic-scales-of-the-stones.jpg

ScyllaDB has achieved feature parity with Cassandra. Now what?

Getty Images/iStockphoto

Special emphasis is placed on Materialized Views, as the ScyllaDB people note this is a production-ready release of a long-awaited experimental feature designed to enable automated server-side table denormalization. They add that the Apache Cassandra community reverted this feature from production-ready to experimental mode in 2017.

"Materialized views turned out to be very complex, both for Cassandra and for Scylla," said Laor. He went on to add that they discovered many unaddressed design issues in the implementation, which caused them to deliver it long after their original plans. Laor noted that there are two main complexities in Materialized Views (MV) for Scylla and Seastar:

  • Complex write path. The write path was designed to be as simple as possible for maximum performance, but MV changes this. The view update mandates a read-before-write to the view. It adds complexity and also a performance penalty that Cassandra has a harder time coping with.
  • Eventual consistency. It's a big challenge to keep the base table and its views synchronized. Updates are fully asynchronous and parallel and it is both a performance challenge not to create a big lag between the view and the base and also a consistency challenge to keep them in-sync even in the face of failures.

Also: Processing time series data: What are the options?

In addition, ScyllaDB claims its global secondary indexes can scale to any cluster size, unlike the counter local-indexing approach adopted by Apache Cassandra. Secondary indexes allow querying data through non-primary key columns. Finally, in terms of parity features, Apache Cassandra 3.x compatible storage format (SSTable) is said to improve performance and reduce storage volume by as much as three times.

Going HTAP

But the really big news about ScyllaDB 3.0 are its HTAP capabilities. Laor, speaking at Scylla Summit 2018, said it developed a ground-breaking OLTP + OLAP service level agreement (SLA) guarantee that puts ScyllaDB on a path toward pure multi-tenancy and positions it favorably against Amazon DynamoDB and Microsoft's Cosmos DB among others.

Scylla Open Source 3.0 will be available in November 2018, with concurrent OLTP and OLAP support available shortly after. That still looks like a big deal, however. Indeed, Laor noted, this is one of the features it is most proud of, as it enables ScyllaDB to support real-time and analytics workloads on the same data centers with best utilization for both:

screen-shot-2018-09-18-at-4-57-02-pm-768x431.png

ScyllaDB is adding HTAP capabilities. This is not unique, but the way it's doing is, according to ScyllaDB.

"Scylla leverages its sophisticated internal engines and schedulers, which already provide similar SLA guarantee capabilities, to the task. In the past, we used the schedulers to isolate foreground operations from background, maintenance operations. This is an improvement and additional implementation of our engine's abilities.

Just to be clear, Scylla is an operational, real-time database. Analytics themselves are performed by additional components, mainly Spark and Presto, over the dataset stored in Scylla. Scylla itself is not full HTAP, but the combination of Spark and Scylla is.

In terms of the technical underpinnings, Scylla manages your CPU and I/O scheduling, which allows you to create roles and assign user shares associated with your workloads. The resources utilized by each workload are tracked and matched against the SLA budget guarantee. It allows you to run different workloads in parallel on the same servers.

Real-time workloads receive the highest priority while other workloads, such as analytics, receive a best-effort approach and will only execute while there is spare capacity. It's a big improvement over what's presently possible, where users are forced to clone their complete dataset in order to run analytics on it so it will not affect the real time OLTP load."

Laor went on to add that no other database vendor is even close to this. This claim, however, is open to interpretation. For starters, DataStax Enterprise, the commercial, hardened version of Cassandra offered by DataStax, also utilizes Apache Spark for analytics.

Also: Knowledge graphs beyond the hype: Getting knowledge in and out of graphs and databases

Then there's also SnappyData and Splice Machine, to mention just some of the vendors building on Spark for HTAP, in addition to a number of others offering similar capabilities. Perhaps ScyllaDB's approach is unique in terms of combining SLAs with HTAP, or the way it prioritizes real-time workloads, but HTAP itself is hardly unique.

Going cloud

An interesting part of ScyllaDB's message was the comparison to Azure CosmosDB. There is grounding to this, as CosmosDB is also compatible with Cassandra's API, and Jonathan Ellis, DataStax CTO and co-founder, has also compared Cassandra to CosmosDB before.

When asked to do a similar comparison for ScyllaDB, Laor acknowledged:

"CosmosDB is impressive and it has made good progress, recently with the Seastar API and active-active. It's hard to make a fair comparison since Cosmos is closed source and it's hard to know what's under the hood. However, the key differences are:

Scylla is open source, no vendor lock-in. With Scylla, hybrid cloud and multi-cloud are valid options. Scylla provides three-times better latency at a fifth of the cost on standard workloads. CosmosDB, like DynamoDB, will suffer from hot partitions with a reserved IO cap per partition.

Cosmos cannot differentiate between workloads as Scylla can. That means you pay even for best effort workloads, unlike Scylla which provides SLA guarantees. Cosmos active-active looks more like a datacenter property and not active-active per node like Scylla. This has an immediate effect on write performance and cost."

hybridcloud.jpg

ScyllaDB is taking to the cloud, going for a Database as a Service offering. Image: ktsimage, Getty Images/iStockphoto

Now, CosmosDB is a cloud-only database. At the time ScyllaDB announced its version 2.0, the acquisition of Seastar.io had just been announced. A year later, a hosted version of Scylla in the cloud seems imminent, but not yet available. What's taking so long, and what will ScyllaDB's hosted version be like? Laor pointed out that it recently launched the Scylla Cloud Early Access Program:

"Built on our Scylla Enterprise database, Scylla Cloud will be disruptive in the DBaaS market. Since it requires far fewer machines to achieve high throughput, its price performance will set a new bar for the industry. We haven't yet publicly announced Scylla Cloud because it's still in Early Access, though registration is available on our website. We are only a few weeks away from opening this up."

Also: The past, present, and future of streaming: Flink, Spark, and the gang

Now that Scylla is on par with Cassandra, Laor said, the next target is to become a leading database-as-a-service and serve as a better alternative for customers than the cloud vendors. Scylla Cloud will be a compelling offering, he went on to add, with three-times better latency at a quarter of the cost and no vendor lock-in.

Kubernetes and beyond

ScyllaDB is also working on adding support for Kubernetes, a trend that is ongoing among vendors offering data platforms. With ScyllaDB founders background in Hypervisors, they are "fully aware and deeply engaged," although currently there is a performance degradation when running ScyllaDB on Kubernetes.

Also: Future directions for Apache Flink/Data Artisans

Laor noted there will be a session on "Getting the Most out of Scylla on Kubernetes" at ScyllaDB Summit. He also mentioned there is a dedicated #kubernetes channel on their Slack, and they are watching as users are deploying and managing Scylla through Kubernetes in their environments.

"There are already a number of GitHub repos specifically for how to deploy Scylla using Kubernetes. The market is evolving, and this is really where being open source allows you to work directly with developers on the operational challenges they are facing. Nevertheless, the cloud, with its virtual machines and auto-scaling already offers better functionality than Kubernetes.

Scylla is a very efficient application. It can be run on fewer machines but dominates them, unlike other databases that cannot fully utilize the resources -- it would be a shame not to run other pods next to them. Thus, on the cloud we recommend to run directly on Linux while we will support full Kubernetes deployments on the cloud as well."

When discussing progress on the business front, Laor noted that, as a privately held company, it doesn't disclose financial information. He also added, however, that it is having a very good year across the board:

"Our open source community is growing quite quickly as word about Scylla continues to spread. 2018 is also the year our newly staffed Sales team began selling our Enterprise Edition in earnest, and during the year, we've added a number of Fortune 50 customers to our roster, along with lots of smaller ones. We've nearly doubled our headcount from a year ago and continue to expand."

Also: Google can now search for datasets. First research, then the world?

As we have noted before, ScyllaDB is not short of ambition. It appears to be well under way in executing its strategy, making notable progress. It will be interesting to see how far this gets it.

Previous and related coverage:

Manyverse and Scuttlebutt: A human-centric technology stack for social applications

Are you aware the web is dying in the stranglehold of big tech, from which you'd like to move away, but feel you don't have an alternative? If you are ready for a completely different paradigm, Manyverse and Scuttlebutt may be your thing.

Pretty low level, pretty big deal: Apache Kafka and Confluent Open Source go mainstream

Apache Kafka is great and all, but it's an early adopter thing, goes the conventional wisdom. Jay Kreps, Kafka co-creator and Confluent CEO, digresses. Mainstream adoption is happening, and it's happening now, he says, while also commenting on latest industry trends.

Apache Spark creators set out to standardize distributed machine learning training, execution, and deployment

Matei Zaharia, Apache Spark co-creator and Databricks CTO, talks about adoption patterns, data engineering and data science, using and extending standards, and the next wave of innovation in machine learning: Distribution.

Opinionated and open machine learning: The nuances of using Facebook's PyTorch

Soumith Chintala from Facebook AI Research, PyTorch project lead, talks about the thinking behind its creation, and the design and usability choices made. Facebook is now unifying machine learning frameworks for research and production in PyTorch, and Chintala explains how and why.