Is the sky the limit for Flash and In-Memory Databases?

Is the sky the limit for Flash and In-Memory Databases?

Summary: In-memory databases are hot, but tiered storage may be our cooler destination.

SHARE:
TOPICS: Big Data
2

This guest post comes courtesy of Tony Baer’s OnStrategies blog. Baer is a principal analyst covering Big Data at Ovum.

Baer Thumb
Tony Baer

Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPUs, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores.

Bandwidth is similarly going crazy; while caps on 4G plans may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become a fact taken for granted.

Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data.

The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.

At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.

Cut through the terminology

But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:

  • CPU cache is the memory in-chip that is used for temporarily holding data being processed by the processor.

  • DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.

  • Solid State Drive (SSD) technology is based on Flash memory, and is the silicon-based, faster substitute to traditional hard drives.  SSDs are typically sized at hundreds of GBytes (with some units just under a terabyte) but are not as fast as DRAM.

  • Hard disk, or "disk," is the workhorse that now scales economically up to 1-3 TBytes per spindle.

So what’s best for which?

For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1-3 TByte drives has made scale-out Internet data centers possible, and with it, as we said already, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1-3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.

But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With flash being at least 10-100-times faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200 RPM disks (currently the state of the art for disk).

Not surprisingly, disk technology development has hit the wall.

Given current price trends,some analysts expect Flash to reach parity with disk in the next 12-18 months (or maybe sooner) and there to be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data, otherwise bound for tape, live; and for Big Data analytics, where the need is for volume.

Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.

So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM data storage.

In-memory databases are hardly new, but until the last three to four years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But now DRAM memory prices have dropped sufficiently to bring IMDBs into the enterprise mainstream.

Kognitio opened the floodgates as it reincarnated its MOLAP cube and row store analytic platform to in-memory roughly 5 years ago; SAP put in-memory in the spotlight with HANA for analytics and transactional applications; followed by Oracle, which reincarnated TimesTen as Exalytics for running Oracle Business Intelligence Enterprise Edition (OBIEE) and Essbase.

Yet, an interesting blip happened on the way to the "inevitable" all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds — for example, that DRAM memory was now ready for prime time — got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.

In-memory databases are dead, long live tiered databases

The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.

IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.

There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside in-chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (which, for example, require less DRAM).

It’s all about balance of system

Balance of system is hardly new, but until recently, it meant trading off CPU or bandwidth with tiers of disk. Application and database design, in turn, focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.

For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.

For in-memory, we’d contend that for most cases, configurations for keeping 100 percent of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There are many ways to speed processing before using the fast storage option.

Storage will become an application design option

Although we’re leery of hopping on the 100 percent DRAM in-memory bandwagon, when smartly deployed, that model could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.

Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.

Whether to use DRAM or Flash will be a function of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will become a configuration decision for application designers as well.

Topic: Big Data

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

2 comments
Log in or register to join the discussion
  • Generally speaking, "memory" is DRAM.

    "But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?)."

    Generally speaking, it's DRAM. Flash is considered permanent storage, like a hard drive. Just because some media folks can't tell the difference doesn't mean there's any confusion between people familiar with the field. I'd say that people in the field always mean DRAM when they say "memory" with no other qualifiers.

    "Turns out, only the latter is true."

    Actually, they're both true; larger densities also means you can read more data in a single rotation, even if it doesn't spin any faster.

    That being said - I don't think anybody has ever argued about speed; flash has always been faster. People and businesses always buy the spindle based drives because of the capacity and the price, not because of the speed.

    "some analysts expect Flash to reach parity with disk in the next 12 – 18 months (or maybe sooner)"

    Like they predicted 10 years ago?

    "Analysts" are a joke.

    Although I have noticed that since we reached the 1 TB mark, progress has been slower for current spindle drives.

    By the way, in-memory databases are nothing new; the technology has been around since the early days of computing. - heck, my own PC does it; many apps use database-like technology (or even databases themselves, like SQLite) and store the database in memory. The only reason why large scale data centers aren't doing such a thing comes down to cost of DRAM, pure and simple. The technology has been there for 20+ years - it's just been too expensive.
    CobraA1
  • A walk down in-memory lane...

    Tony:

    Especially like your points on “balance of system” and a tiered approach to data processing. It should also be noted that SSD, while faster than conventional disk, still carries the issue of overhead that is overcome by DRAM processing.

    Clearing up some facts…. Kognitio is, and has always been an in-memory database. From its first iteration in 1989, it has been optimized for analytical workloads, ergo, an “in-memory analytical platform.” We share the opinion of IBM and Teradata – best practice dictates that not all data needs to be in-memory but a tiered approach, with hot data being “pinned” into memory is an optimal way to serve up near-real-time analytics on large and complex data sets. Differing from those platforms, Kognitio offers a best-of-breed approach with a robust version 8 in-memory offering that avoids the lock-in to those large vendors.

    While the combination of native MDX capability and MPP in-memory gives us the ability to do “virtual cubes”, which might replace a MOLAP tool, Kognitio is more than that – it enables the “information anywhere” approach that is favored by Gartner in their “Logical Data Warehouse” Model (as well as others like Radiant Advisors’ “Modern Data Platform” and Enterprise Management Associates’ “Hybrid Data Ecosystem”).

    Enterprise Architects looking to optimize their approach to information management will look to employ an in-memory layer that can be generally accessible for their SQL and NoSQL requirements via their incumbent BI and business applications, but data 100% in-memory is still a long way off.
    Michael Hiskey