Yesterday, Microsoft's Dave Campbell, a Technical Fellow on the SQL Server team, posted to the SQL Server team blog on the subject of in-memory database technology. Mary Jo Foley, our "All About Microsoft" blogger here at ZDNet, provided some analysis on Campbell's thoughts in a post of her own. I read both, and realized there's an important Big Data side to this story.
In a nutshell
In his post, Campbell says in-memory is about to hit a tipping point and, rather than leaving that assertion unsubstantiated, he provided a really helpful explanation as to why.
Campbell points out that there's been an interesting confluence in the database and computing space:
- Huge advances in transistor density (and, thereby, in memory capacity and multi-core ubiquity)
- As-yet untranscended limits in disk seek times (and access latency in general)
This combination of factors is leading -- and in some cases pushing -- the database industry to in-memory technology. Campbell says that keeping things closer to the CPU, and avoiding random fetches from electromechanical hard disk drives, are the priorities now. That means bringing entire databases, or huge chunks of them, into memory, where they can be addressed quickly by processors.
Compression and column stores
Compression is a big part of this and, in the Business Intelligence world, so are column stores. Column stores keep all values for a column (field) next to each other, rather than doing so with all the values in a row (record). In the BI world, this allows for fast aggregation (since all the values you're aggregating are typically right next to each other) and high compression rates.
Microsoft's xVelocity technology (branded as "VertiPaq" until quite recently) uses in-memory column store technology. The technology manifested itself a few years ago as the engine behind PowerPivot, a self-service BI add-in for Excel and SharePoint. With the release of SQL Server 2012, this same engine has been implemented inside Microsoft's full SQL Server Analysis Services component, and had been adapted for use as a special columnstore index type in the SQL Server relational database as well.
The BD Angle
How does this affect Big Data? I can think of a few ways:
- As I've said in a few posts here, Massively Parallel Processing (MPP) data warehouse appliances are Big Data products. A few of them use columnar, in-memory technology. Campbell even said that columnstore indexes will be added to Microsoft's MPP product soon. So MPP has already started to go in-memory.
- Some tools that can connect to Hadoop and can provide analysis and data visualization services for its data, may use in-memory technology as well. Tableau is one example of a product that does this.
- Databases used with Hadoop, like HBase, Cassandra and HyperTable, fall into the "wide column store" category of NoSQL databases. While NoSQL wide column stores and BI column store databases are not identical, their technologies are related. That creates certain in-memory potential for HBase and other wide column stores, as their data is subject to high rates of compression.
Keeping Hadoop in memory
Hadoop's MapReduce approach to query processing, to some extent, combats disk latency though parallel computation. This seems ripe for optimization though. Making better use of multi-core processing within a node in the Hadoop cluster is one way to optimize. I've examined that in a recent post as well.
- Also Read: Making Hadoop optimizations Pervasive
Perhaps using in-memory technology in place of disk-based processing is another way to optimize Hadoop. Perhaps we could even combine the approaches: Campbell points out in his post that the low latency of in-memory technology allows for better utilization of multi-cores.
Campbell also says in-memory will soon work its way into transactional databases and their workloads. That's neat, and I'm interested in seeing it. But I'm also interested in seeing how in-memory can take on Big Data workloads.
Perhaps the Hadoop Distributed File System (HDFS) might allow in-memory storage to be substituted in for disk-based storage. Or maybe specially optimized solid state disks will be built that have performance on par with RAM (Random Access Memory). Such disks could then be deployed to nodes in a Hadoop cluster.
No matter what, MapReduce, powerful though it is, leaves some low hanging fruit for the picking. The implementation of in-memory technology might be one such piece of fruit. And since Microsoft has embraced Hadoop, maybe it will take a run at making it happen.
For an approach to Big Data that does use in-memory technology but does not use Hadoop, check out JustOneDB. I haven't done much due diligence on them, but I've talked to their CTO, Duncan Pauly, about the product. He and the company seem very smart and have some fairly breakthrough ideas about databases today and how they need to change.