The Apache Software Foundation recently announced Apache Cassandra Release 0.6, a NoSQL database. As a reformed database architect, I was intrigued by the appearance of yet another data management model.
Here's what the Apache Software Foundation has to say about Cassandra
The Apache Software Foundation (ASF) --developers, stewards, and incubators of 138 Open Source projects-- today announced Apache Cassandra version 0.6, the Project's latest release first since its graduation from the ASF Incubator in February 2010.
Apache Cassandra is an advanced, second-generation "NoSQL" distributed data store that has a shared-nothing architecture. The Cassandra decentralized model provides massive scalability, and is highly available with no single point of failure even under the worst scenarios.
Originally developed at Facebook and submitted to the ASF Incubator in 2009, the Project has added more than a half-dozen new committers, and is deployed by dozens of high-profile users such as Cisco WebEx, Cloudkick, Digg, Facebook, Rackspace, Reddit, and Twitter, among others.
Cassandra 0.6 features include:
- Support for Apache Hadoop: this allows running analytics queries with the leading map/reduce framework against data in Cassandra.
- Integrated row cache: this eliminates the need for a separate caching layer, thereby simplifying architectures.
- Increased speed: this builds on Cassandra's highly-launded ability to process thousands of writes per second, allowing solutions of all kinds to cope with increasing write loads.
Over my years in the IT industry, I've worked with a number of data management architectures. Each generation of data management added some useful things that made life quite a bit easier for developers. Recently, I've run across a newcomer to the list - the NoSQL data management approach.
Here are some of the data management tools that you may find in your datacenter.
- "Navigational database" -- a user created data management approach that combined several application specific indexes and a direct access data file. Applications would look up relevant data in an index file, find the right record number in the direct access file and then look up that record in the direct access file. This approach was light weight, fast, easy to build and a bear to maintain. It was used in early minicomputer and mainframe applications. If an index file and the data file got out of synchronization, it was necessary to build or use a tool that scanned the entire database and rebuilt the broken index file. Some of these applications are still mainstays in organizations' IT infrastructure.
- Pick and MUMPS (now M) databases -- loosely structured data files that only contained records and fields stored by applications. This approach was in use on minicomputers long before the relational model was developed. It is still in use today. Access to the data was provided by a sophisticated multi-way binary tree structured index. If the database was structured properly by the architect, finding and updating a record was blindingly fast. This approach also simplified applications that needed to support an ever-changing list of fields. If a new field was needed, the developer simply started storing them in records as that data became available. Older programs worked unchanged. The database didn't needed to be restructured or reloaded. This approach produced very, very compact data structures. Applications that needed to transverse the entire database took forever to execute however.
- Network or chain databases -- records were linked together by a network of pointers, called chains. This approach was used for mainframe and midrange transactional systems before the relational model was implemented. Like the Pick and MUMPS databases, access to any specific record was very fast. Searching through the whole database for something could take forever. It was possible to construct a query that would jump from chain to chain, never finding the requested data. This, of course, would lock up the data management engine. Some really awful practical jokes were implemented this way.
- Relational databases -- records consisting of predefined fields are stored in predefined tables. Tables are related to one another using several simple, but powerful, operations. This approach has came to the forefront and became the standard approach for most applications in the mid 1980s. This approach is highly structured making it difficult to add new fields or to change the length of fields in a table. By now, just about every developer knows a great deal about using relational databases. Creating optimal databases, however, is as much an art as a science. Highly skilled database architects are needed.
Now we're seeing the emergence of something new, an approach built upon a highly distributed cache in which there are a number of servers, each hosting part of the data for the application. No SQL language interface is exposed to developers. Applications access the data through a middleware layer. The middleware layer knows how to search all of the individual data caches to find and then update records. The application doesn't know where the data is actually stored at execution time. This approach can be highly scalable.
As with other areas of technology, large organizations are likely to have all of these approaches implemented in the datacenter. IT managers are finding it more and more difficult to maintain these systems because finding a top expert MUMPS (now called M) or PICK can be rather challenging. So, organizations do their best to never touch these applications.
Will NoSQL database products, such as Cassandra, become the standard approach over time? That's not likely to happen any time soon. Organizations have too much investment in applications based upon other approaches and are unlikely to abandon them now.
Will new applications be built using NoSQL approaches? Organizations needing extreme transactional processing will need to consider this approach and other approaches for distributed cache or "virtual memory" databases, such as those offered by RNA Networks or Gemstone.