A rock and a hard place: Between ScyllaDB and Cassandra
How many NoSQL databases does the world really need, and how easily would you switch your existing solution for a new one? Asking these questions before setting out to build a NoSQL database is a good thing. The people behind ScyllaDB did, and now Cassandra may be between a rock and hard place.
Cassandra is a poster child of the NoSQL world. Originally an open source project sprung out of Facebook, it has been adopted by the Apache Foundation and backed by an enterprise, DataStax, that also offers DataStax Enterprise based on Cassandra. Cassandra is among the top 10 database solutions according to DB-Engines.
That is precisely why it now has a potentially dangerous rival in ScyllaDB. ScyllaDB is a new kid on the NoSQL block aiming to offer a solution that is open source and API-compatible with Cassandra, but performs much better. The goal is to be a drop-in replacement for Cassandra, and when we're talking about database #8 in the world, that's kind of a big deal.
Dor Laor and Avi Kivity did not set out with this grandiose plan back in 2013. It was not for lack of ambition, but this just was not their thing. They both have backgrounds in hypervisors and were part of the team that built KVM and got acquired by Red Hat. Leaving Red Hat, their initial plan was to write a unikernel that would displace Linux from cloud servers. So no lack of ambition there.
They founded a startup called Cloudius, found investors, assembled a team and started working hard. At some point however they realized that their potential would not be reached for a number of reasons, and decided to pivot. And pivot they did, to add another NoSQL database to the never ending list, one that would be able to do what Cassandra does and then some.
But why go for a NoSQL database, and why target Cassandra?
Part of Cloudius mission was to speed up server loads, with an emphasis on databases. Laor, ScyllaDB CEO, says that they had managed to boost Redis performance by 70 percent without actually doing anything Redis-specific. You may wonder how was that possible, and there is an answer, but for now let's stick to the fact that this triggered them to take that direction.
It was a combination of market and technical reasons that made Cloudius target Cassandra. Laor says Hadoop was in their list as well, but since that had already been done they decided to go for rewriting Cassandra: "The world does not need another database format. Cassandra's format is good, and it is successful. Cassandra is the best high availability platform out there."
They say imitation is the sincerest form of flattery, and it's obvious from this that the ScyllaDB team found Cassandra worth imitating. But it's more complicated than that: "Cassandra is everywhere in critical workloads. But when we targeted it for optimization, we ran against limitations tied to its JVM nature. In the end, Cassandra ends up competing with itself.
Cloudius pivoted and rebranded, but kept the same team and investors. Thus ScyllaDB was born. You may think it's cheeky to target "the best high availability platform out there" and aim to do better, but Laor says they are hoping to see history repeating. And the entirety of that quote, "imitation is the sincerest form of flattery that mediocrity can pay to greatness," may not necessarily apply here.
"When we entered the market with KVM, all the players were established -- VMWare, HyperV, Xen. We showed up last, but based on Avi's revolutionary design KVM now dominates. We think our differentiation this time around is even bigger," says Laor.
So what is this differentiation? ScyllaDB promises something simple, alluring, and hard to believe: keep your codebase, replace Cassandra with ScyllaDB, get up to 10 times boost in performance. There are benchmarks and references to back those claims, but how can this possibly work? It comes down to a number of things.
First, different implementation language. ScyllaDB has been rewritten from scratch in C++, as opposed to Cassandra's Java-based codebase. The JVM adds an intermediate layer between source code and hardware, trading portability and ease of use for performance. JVMs have come a long way, but the proper use of a language closer to the low-level fundamentals may result in better performance.
But that's only part of ScyllaDB's secret sauce. An equally big part has to do with those underlying fundamentals, such as memory or socket allocation. The kind of nitty gritty details that are hard to get, program, and maintain, but can result in dramatic improvements. The kind of thing that you get to know intimately if you program, say, a hypervisor.
All those lessons learned through years of low level programming have been distilled in SeaStar. SeaStar is an open source framework for high performance applications that ScyllaDB is built on, although there is nothing database-specific about it. SeaStar is event-driven and enables writing efficient non-blocking, asynchronous code.
The tradeoff? Complexity. Laor admits it's hard to program on top of SeaStar, but says the result is worth the effort. He mentions for example Pedis, a rewrite of Redis based on SeaStar done by Alibaba, which turbo-charges Redis. Besides, ScyllaDB promises, the average Cassandra user does not need to worry about that.
ScyllaDB aims to ease the complex task of configuring and tuning Cassandra deployments by offering auto-tuning capabilities. ScyllaDB has added improvements in both node management and network protocols with the goal of having clusters running optimally without requiring administrator intervention.
Laor compared this feature to Oracle's self tuning database. There are however similar solutions for other platforms too, such as Spark. For Spark, some approaches are based on using machine learning on datasets gathered from many operational clusters, some others on rules.
ScyllaDB has adopted the rule-based approach, as Laor does not believe datasets may be representative of all possible configurations. "We use developer intelligence, not artificial intelligence," he says. Arguably, datasets from operational Cassandra clusters would be hard to come by for ScyllaDB anyway. Which brings us to an interesting point.
A rock and a hard place
On the one hand, the decision to build a new platform that is compatible with an existing one reduces friction and lowers the adoption barrier for organizations. ScyllaDB already has names such as Samsung, IBM, and Outbrain among its early adopters using it in production.
On the other hand, it induces friction with the platform the newcomer aims to displace: Cassandra. We've seen similar examples in the Spark world, but the difference is that Spark alternatives are still largely based on Spark so there can be cross-pollination and eventually perhaps convergence.
Here we're talking about a radical departure -- different implementation language, different low-level infrastructure, different network protocols. There really is no room for Cassandra and ScyllaDB to play side by side, as amply exemplified by the fact they cannot even coexist in a cluster.
Typically, Laor says, people set up a proof of concept ScyllaDB cluster working side by side with Cassandra until they feel confident enough to make the switch. "We have different protocols. We considered supporting Cassandra protocols, but there are so many versions out there we decided against it. Plus, when things go wrong in a mixed cluster, whom will you blame?"
Could that hurt adoption? "We are not married to our databases, that's what people tell us," says Laor. "It's a big investment, but they can change. Choosing Cassandra was a strategic decision for us. We started from scratch and rewrote everything. When you do that, you create antagonism. It touches many people, it's sensitive.
But the results speak for themselves. For example, an AdTech client of ours has managed to go from 100,000 timeouts per second with Cassandra to 100 per second with ScyllaDB. We have not been doing much in terms of collaboration, mostly because at the moment we are heads-down working on feature parity. But like KVM and Xen, where we had common interfaces, there may be potential for collaboration."
Laor mentions some areas in which they are contributing to the Cassandra community, such as ScyllaDB CTO presenting design choices at Cassandra next generation conference or contributing a driver for Go. He also emphasizes that ScyllaDB is an open source project and they try to document and disseminate design decisions and implementation and says they would like to work with Cassandra on certain features in the future.
ScyllaDB is a newcomer, but on paper at least it looks like it's got what it takes to displace a heavyweight such as Cassandra with DataStax's enterprise backing. The team has been there and done it before, feature parity is almost there, financials and organizational structure seem to be there as well.
ScyllaDB is well funded, with a total of $25 million, and has a team of 45 (mostly engineers) working together for years. On the technical front, it seems like ScyllaDB can give Cassandra a run for its money. But what does that "hostile takeover" mean for Cassandra, DataStax and the community? Will ScyllaDB be able to win hearts and minds?
It seems the Cassandra community is currently in somewhat of a turmoil anyway. There has been some friction between DataStax and the Apache Foundation, resulting in uncertainty about the project's future and direction. So to be a Cassandra user today may mean you are between a rock and a hard place.
DataStax on its part did not reply to a request for comment. ScyllaDB on the other hand says their community is growing, despite the fact that the entry barrier is high due to the complex nature of their implementation, and that they have practically achieved feature parity.
ScyllaDB 2.0 is being announced today at Scylla Summit, bringing some highly sought after features such as counters and materialized views. According to Laor, full feature parity will be achieved in early 2018. Add to the mix the recent acquisition of Seastar.io, which will act as a catalyst for ScyllaDB to offer a managed cloud version, and you see why ScyllaDB is a name you may be hearing more in the future.
Speaking of names, what's with ScyllaDB's name anyway? Apparently its founders wanted to use a name from Greek mythology, as was the case for Cassandra. According to them in some parts of the world "Scylla" is pronounced "scale-ah," which alludes to scalability, and thus a name was born.
Ironically, Cassandra was an Oracle nobody would listen to. Scylla and Charybdis were a monster and a whirlpool guarding the strait of Messina, making it impossible to navigate past them. To be between Scylla and Charybdis is to be between a rock and a hard place. But to be between ScyllaDB and Cassandra may turn out to be a good thing for the community, should it eventually steer clear of the antagonism.