In-memory databases at Microsoft and elsewhere
Summary: A Technical Fellow at Microsoft says we're headed for an in-memory database tipping point. What does this mean for Big Data?
Yesterday, Microsoft's Dave Campbell, a Technical Fellow on the SQL Server team, posted to the SQL Server team blog on the subject of in-memory database technology. Mary Jo Foley, our "All About Microsoft" blogger here at ZDNet, provided some analysis on Campbell's thoughts in a post of her own. I read both, and realized there's an important Big Data side to this story.
In a nutshell
In his post, Campbell says in-memory is about to hit a tipping point and, rather than leaving that assertion unsubstantiated, he provided a really helpful explanation as to why.
Campbell points out that there's been an interesting confluence in the database and computing space:
- Huge advances in transistor density (and, thereby, in memory capacity and multi-core ubiquity)
- As-yet untranscended limits in disk seek times (and access latency in general)
This combination of factors is leading -- and in some cases pushing -- the database industry to in-memory technology. Campbell says that keeping things closer to the CPU, and avoiding random fetches from electromechanical hard disk drives, are the priorities now. That means bringing entire databases, or huge chunks of them, into memory, where they can be addressed quickly by processors.
Compression and column stores
Compression is a big part of this and, in the Business Intelligence world, so are column stores. Column stores keep all values for a column (field) next to each other, rather than doing so with all the values in a row (record). In the BI world, this allows for fast aggregation (since all the values you're aggregating are typically right next to each other) and high compression rates.
Microsoft's xVelocity technology (branded as "VertiPaq" until quite recently) uses in-memory column store technology. The technology manifested itself a few years ago as the engine behind PowerPivot, a self-service BI add-in for Excel and SharePoint. With the release of SQL Server 2012, this same engine has been implemented inside Microsoft's full SQL Server Analysis Services component, and had been adapted for use as a special columnstore index type in the SQL Server relational database as well.
The BD Angle
How does this affect Big Data? I can think of a few ways:
- As I've said in a few posts here, Massively Parallel Processing (MPP) data warehouse appliances are Big Data products. A few of them use columnar, in-memory technology. Campbell even said that columnstore indexes will be added to Microsoft's MPP product soon. So MPP has already started to go in-memory.
- Some tools that can connect to Hadoop and can provide analysis and data visualization services for its data, may use in-memory technology as well. Tableau is one example of a product that does this.
- Databases used with Hadoop, like HBase, Cassandra and HyperTable, fall into the "wide column store" category of NoSQL databases. While NoSQL wide column stores and BI column store databases are not identical, their technologies are related. That creates certain in-memory potential for HBase and other wide column stores, as their data is subject to high rates of compression.
Keeping Hadoop in memory
Hadoop's MapReduce approach to query processing, to some extent, combats disk latency though parallel computation. This seems ripe for optimization though. Making better use of multi-core processing within a node in the Hadoop cluster is one way to optimize. I've examined that in a recent post as well.
- Also Read: Making Hadoop optimizations Pervasive
Perhaps using in-memory technology in place of disk-based processing is another way to optimize Hadoop. Perhaps we could even combine the approaches: Campbell points out in his post that the low latency of in-memory technology allows for better utilization of multi-cores.
Campbell also says in-memory will soon work its way into transactional databases and their workloads. That's neat, and I'm interested in seeing it. But I'm also interested in seeing how in-memory can take on Big Data workloads.
Perhaps the Hadoop Distributed File System (HDFS) might allow in-memory storage to be substituted in for disk-based storage. Or maybe specially optimized solid state disks will be built that have performance on par with RAM (Random Access Memory). Such disks could then be deployed to nodes in a Hadoop cluster.
No matter what, MapReduce, powerful though it is, leaves some low hanging fruit for the picking. The implementation of in-memory technology might be one such piece of fruit. And since Microsoft has embraced Hadoop, maybe it will take a run at making it happen.
Addendum
For an approach to Big Data that does use in-memory technology but does not use Hadoop, check out JustOneDB. I haven't done much due diligence on them, but I've talked to their CTO, Duncan Pauly, about the product. He and the company seem very smart and have some fairly breakthrough ideas about databases today and how they need to change.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback
It's funny how in-memory is news
Disk prices are probably a factor
Massively Parallel Processing? Isn't this more about, instantly available
Massively Parallel Processing has existed for decades now, and the new wisdom is about instantly available data, which is resident and available to the "massively parallel processors" for the many different tasks that the data can be used for, by as many different users as the system can handle.
However, eventually, a lot of that data cannot be kept "resident" and it's going to have to be "offloaded" into the slower mechanisms, such as hard-drives. The hard-drives can be used to store relevant data, but more in the historical side, like "older articles" or "older orders", etc. Either way, the access to the data should be transparent, no matter what mechanism is used to hold that data. So, a user looking to access a list of orders that could encompass several years of data, should not be concerned about where the data is stored, or even notice a slowdown/delay in accessing of that data.
JustOneDB
I just want to clarify that JustOneDB is not an in-memory database per se as it is designed to amortize read/write latency times regardless of where the data resides relative to the CPU (in memory, SSD or HDD). This is especially true for slow high-capacity hard-disk which is still the natural home for multi-terabyte big data.
You Forgot Someone...
The new Supercomputer is the database
The idea of data blocks in current designs don't fit (well, it's more complicated than that). So the MS Solution won't be long lasting. The current in-memory technologies fit specialized needs or can't handle the dataload of Big Data or huge simulations, where we will see demands of fast loading times for short lived data (in RAM) and huge demands of io, network, disks and CPU. CPU io means the hardware that is directly running the database and other CPU, for example running search engines or computational models, ore even other databases.
In my opinion, a new age in supercomputing will begin, and maybe finally, an new generation of AI.
The technical aspect and possibilities
In-memory databases are very cost-efficient
Of course when having the main database storage in RAM you need to store image files and transaction logs on some kind of non-volatile memory to guarantee full database recovery.
Peter Idestam-Almquist, CTO of Starcounter
I'm lost as to where people are getting their numbers.
I don't know the price of bulk RAM or HDs, so it's entirely possible that when you get up to the petabyte scale, they're the same, but I can start with retail pricing and I'm betting it's scales about the same at the high end bulk.
2GB of RAM is $30. 1 Petabyte of RAM = $15,000,000. You'll need 500,000 sticks of RAM. It's tricky to work out the current requirements (it varies wildly depending on what you're doing) but it ranges 160mA to 4500mA... so let's be extremely generous and say 160mA * 500,000 = 80,000A @ 1.2v = 96,000W.
1TB HD is around $50. 1 Petabyte of HD = $50,000. If you use 4TB drives, you'll need 250 drives. The WD 4GB Caviar Green drive consumes 10.4W/drive = 2,600W.
That leaves speed. If you place each drive on its own controller, and distribute your DB across the drives with redundancy, while you won't get the speed of RAM, you will get a substantial improvement in speed - without sacrificing data reliability. And since the price is so much lower, you can afford to have triple redundancy without even getting close to the price of RAM and with less power requirements.
There are definitely cases where in-memory makes more sense - but these are exceptional cases. It doesn't make sense to generalize them to all cases without considering the real world costs of it.