The Apache Software Foundation wishes a Happy New Year to the Big Data and NoSQL worlds as it today announces the v1.2 release of the open source Cassandra database. While a "dot release" (increment in the version number’s decimal portion) might indicate a small bump in functionality, v1.2 is in nonetheless a big release. Major improvements include the addition of virtual nodes, enhancements to the Cassandra Query Language (CQL), request tracing, atomic batches, better management around disk failures and various performance improvements, including to memory usage and column indexes.
Important new features, explained
Virtual Nodes (vNodes) optimize and hasten recovery operations when a physical node fails. This is a critical win for a distributed database like Cassandra that is designed to run on commodity hardware, which can be failure-prone. Existing Cassandra clusters can be upgraded to use vNodes, but that is optional.
Atomic batches are very important as well. Prior to v1.2, if an operation’s traffic cop-like "coordinator" node failed when a group (batch) of updates were applied to the database, Cassandra could be left in an inconsistent (partially updated) state. That's something that major relational databases would never allow, thus rendering Cassandra inappropriate for various mission-critical, operational workloads. But the atomic batches in v1.2 prevent such inconsistencies, by ensuring that groups of updates are treated as indivisible (atomic) units of work: either all the updates succeed or all of them fail. If they all fail, then the batch is reapplied, and there’s no need to determine which individual updates failed or succeeded.
Atomic batches do bring a performance hit, but in the database world, there's no free lunch. That's what the CAP theorem is all about.
A big release for Big Data?
The worlds of Big Data and NoSQL are often conflated. While they’re not the same thing, the NoSQL category known as Wide Column Stores (and, alternately, Column Family Stores) does tie into Big Data rather tightly. One Wide Column Store is HBase, which uses Hadoop’s Distributed File System (HDFS) and is included in most Hadoop distributions. Cassandra is another Wide Column Store, and while it doesn't use HDFS, it can integrate with with Hadoop’s MapReduce processing engine. As such, its importance in the Big Data realm is significant.
Cassandra is also important to cloud computing overall, since Cassandra clusters can be implemented as cloud databases. DataStax Enterprise, a premium distribution of Cassandra, is an explicitly supported product on HP’s cloud platform. DataStax also offers an Amazon Machine Image (AMI) to run the product on Amazon Web Services' Elastic Compute Cloud (EC2).
Not just for startups
The enterprise-ation of NoSQL and Big Data continues unabated. In the case of Cassandra v1.2, manageability and database consistency are being addressed in a very studious fashion. Such a focus on reliability and atomic operations indicates a noteworthy maturity in the NoSQL market.