In-memory databases at Microsoft and elsewhere

In-memory databases at Microsoft and elsewhere

Summary: A Technical Fellow at Microsoft says we're headed for an in-memory database tipping point. What does this mean for Big Data?

SHARE:

Yesterday, Microsoft's Dave Campbell, a Technical Fellow on the SQL Server team, posted to the SQL Server team blog on the subject of in-memory database technology.  Mary Jo Foley, our "All About Microsoft" blogger here at ZDNet, provided some analysis on Campbell's thoughts in a post of her own.  I read both, and realized there's an important Big Data side to this story.

 

In a nutshell

In his post, Campbell says in-memory is about to hit a tipping point and, rather than leaving that assertion unsubstantiated, he provided a really helpful explanation as to why. 

Campbell points out that there's been an interesting confluence in the database and computing space:

  • Huge advances in transistor density (and, thereby, in memory capacity and multi-core ubiquity)
  • As-yet untranscended limits in disk seek times (and access latency in general)

This combination of factors is leading -- and in some cases pushing -- the database industry to in-memory technology.  Campbell says that keeping things closer to the CPU, and avoiding random fetches from electromechanical hard disk drives, are the priorities now.  That means bringing entire databases, or huge chunks of them, into memory, where they can be addressed quickly by processors.

 

Compression and column stores

Compression is a big part of this and, in the Business Intelligence world, so are column stores.  Column stores keep all values for a column (field) next to each other, rather than doing so with all the values in a row (record).  In the BI world, this allows for fast aggregation (since all the values you're aggregating are typically right next to each other) and high compression rates. 

Microsoft's xVelocity technology (branded as "VertiPaq" until quite recently) uses in-memory column store technology. The technology manifested itself a few years ago as the engine behind PowerPivot, a self-service BI add-in for Excel and SharePoint.  With the release of SQL Server 2012, this same engine has been implemented inside Microsoft's full SQL Server Analysis Services component, and had been adapted for use as a special columnstore index type in the SQL Server relational database as well.

 

The BD Angle

How does this affect Big Data?  I can think of a few ways:

  1. As I've said in a few posts here, Massively Parallel Processing (MPP) data warehouse appliances are Big Data products.  A few of them use columnar, in-memory technology.  Campbell even said that columnstore indexes will be added to Microsoft's MPP product soon.  So MPP has already started to go in-memory.

  2. Some tools that can connect to Hadoop and can provide analysis and data visualization services for its data, may use in-memory technology as well.  Tableau is one example of a product that does this.
  3. Databases used with Hadoop, like HBase, Cassandra and HyperTable, fall into the "wide column store" category of NoSQL databases.  While NoSQL wide column stores and BI column store databases are not identical, their technologies are related.  That creates certain in-memory potential for HBase and other wide column stores, as their data is subject to high rates of compression.

 

Keeping Hadoop in memory

Hadoop's MapReduce approach to query processing, to some extent, combats disk latency though parallel computation.  This seems ripe for optimization though.  Making better use of multi-core processing within a node in the Hadoop cluster is one way to optimize.  I've examined that in a recent post as well.

Perhaps using in-memory technology in place of disk-based processing is another way to optimize Hadoop.  Perhaps we could even combine the approaches: Campbell points out in his post that the low latency of in-memory technology allows for better utilization of multi-cores.

Campbell also says in-memory will soon work its way into transactional databases and their workloads.  That's neat, and I'm interested in seeing it.  But I'm also interested in seeing how in-memory can take on Big Data workloads. 

Perhaps the Hadoop Distributed File System (HDFS) might allow in-memory storage to be substituted in for disk-based storage.  Or maybe specially optimized solid state disks will be built that have performance on par with RAM (Random Access Memory). Such disks could then be deployed to nodes in a Hadoop cluster.

No matter what, MapReduce, powerful though it is, leaves some low hanging fruit for the picking.  The implementation of in-memory technology might be one such piece of fruit.  And since Microsoft has embraced Hadoop, maybe it will take a run at making it happen.

 

Addendum

For an approach to Big Data that does use in-memory technology but does not use Hadoop, check out JustOneDB.  I haven't done much due diligence on them, but I've talked to their CTO, Duncan Pauly, about the product.  He and the company seem very smart and have some fairly breakthrough ideas about databases today and how they need to change.

Topics: Hardware, Data Centers, Data Management, Enterprise Software, Microsoft, Software, Storage

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

8 comments
Log in or register to join the discussion
  • It's funny how in-memory is news

    This is computers 101. RAM is faster than disk. Here's the next news story: RAM is volatile, but MRAM will save us all...
    happyharry_z
  • Disk prices are probably a factor

    Sure, RAM is still several times the price of HD per GB. A quick layman's estimate puts it at about 100 times more expensive using consumer hardware (a 2TB HD and 20GB of ram are in the same ballpark). Anyone know how the price would scale when using ECC memory and 15kRPM HDs?
    Li1t
  • Massively Parallel Processing? Isn't this more about, instantly available

    data?

    Massively Parallel Processing has existed for decades now, and the new wisdom is about instantly available data, which is resident and available to the "massively parallel processors" for the many different tasks that the data can be used for, by as many different users as the system can handle.

    However, eventually, a lot of that data cannot be kept "resident" and it's going to have to be "offloaded" into the slower mechanisms, such as hard-drives. The hard-drives can be used to store relevant data, but more in the historical side, like "older articles" or "older orders", etc. Either way, the access to the data should be transparent, no matter what mechanism is used to hold that data. So, a user looking to access a list of orders that could encompass several years of data, should not be concerned about where the data is stored, or even notice a slowdown/delay in accessing of that data.
    adornoe
  • JustOneDB

    Thanks for the mention Andrew.

    I just want to clarify that JustOneDB is not an in-memory database per se as it is designed to amortize read/write latency times regardless of where the data resides relative to the CPU (in memory, SSD or HDD). This is especially true for slow high-capacity hard-disk which is still the natural home for multi-terabyte big data.
    DuncanPauly
  • You Forgot Someone...

    You forgot one of the first in-memory database products and vendors that has been on the market for over a year - SAP's HANA. Many of the other vendors such as Microsoft and Oracle are now playing catch-up.
    smtp4me@...
  • The new Supercomputer is the database

    I have been interested in in-RAM databases for over a decade. I don't belive that current database models of MS or Oracle will suffice.
    The idea of data blocks in current designs don't fit (well, it's more complicated than that). So the MS Solution won't be long lasting. The current in-memory technologies fit specialized needs or can't handle the dataload of Big Data or huge simulations, where we will see demands of fast loading times for short lived data (in RAM) and huge demands of io, network, disks and CPU. CPU io means the hardware that is directly running the database and other CPU, for example running search engines or computational models, ore even other databases.
    In my opinion, a new age in supercomputing will begin, and maybe finally, an new generation of AI.

    The technical aspect and possibilities
    raggi
  • In-memory databases are very cost-efficient

    The cost of storing entire databases in RAM is rather low, but the advantages are enormous. By storing the entire database in RAM you are able to process millions of database transactions per second and serve millions of simultaneous users on a single database node on a standard server. Consequently, a single in-memory database node can replace up to a hundred disk-centric database nodes, and will therefore be very cost efficient.
    Of course when having the main database storage in RAM you need to store image files and transaction logs on some kind of non-volatile memory to guarantee full database recovery.

    Peter Idestam-Almquist, CTO of Starcounter
    peteria
  • I'm lost as to where people are getting their numbers.

    The price of RAM is low - compared to RAM say two years ago. But the price of RAM compared to hard drive isn't low at all.

    I don't know the price of bulk RAM or HDs, so it's entirely possible that when you get up to the petabyte scale, they're the same, but I can start with retail pricing and I'm betting it's scales about the same at the high end bulk.

    2GB of RAM is $30. 1 Petabyte of RAM = $15,000,000. You'll need 500,000 sticks of RAM. It's tricky to work out the current requirements (it varies wildly depending on what you're doing) but it ranges 160mA to 4500mA... so let's be extremely generous and say 160mA * 500,000 = 80,000A @ 1.2v = 96,000W.

    1TB HD is around $50. 1 Petabyte of HD = $50,000. If you use 4TB drives, you'll need 250 drives. The WD 4GB Caviar Green drive consumes 10.4W/drive = 2,600W.

    That leaves speed. If you place each drive on its own controller, and distribute your DB across the drives with redundancy, while you won't get the speed of RAM, you will get a substantial improvement in speed - without sacrificing data reliability. And since the price is so much lower, you can afford to have triple redundancy without even getting close to the price of RAM and with less power requirements.

    There are definitely cases where in-memory makes more sense - but these are exceptional cases. It doesn't make sense to generalize them to all cases without considering the real world costs of it.
    TheWerewolf