This guest post comes courtesy of Tony Baer’s OnStrategies blog. Baer is a principal analyst covering Big Data at Ovum.
Big Data is getting bigger, and Fast Data is getting faster because of the continuing declining cost of all things infrastructure. Ongoing commoditization of powerful, multi-core CPUs, storage media, and connectivity made scale-out Internet data centers possible, and with them, scale-out data platforms such as Hadoop and the new generation of Advanced SQL/NewSQL analytic data stores.
Bandwidth is similarly going crazy; while caps on 4G plans may make bandwidth seem elusive to mobile users, growth of bandwidth for connecting devices and things has become a fact taken for granted.
Conventional wisdom is that similar trends are impacting storage, and until recently, that was the Kool-Aid that we swallowed. For sure, the macro picture is that declining price and ascending density curves are changing the conversation where it comes to deploying data.
The type of media on which you store data is no longer just a price/performance tradeoff, but increasingly an architectural consideration on how data is processed and applications that run on data are engineered. Bigger, cheaper storage makes bigger analytics possible; faster, cheaper storage makes more complex and functional applications possible.
At 100,000 feet, such trends for storage are holding, but dig beneath the surface and the picture gets more nuanced. And those nuances are increasingly driving how we design our data-driven transaction applications and analytics.
Cut through the terminology
But before we dive into the trends, let’s get our terminology straight, because the term memory is used much too loosely (does it mean DRAM or Flash?). For this discussion, we’ll stick with the following conventions:
CPU cache is the memory in-chip that is used for temporarily holding data being processed by the processor.
DRAM memory is the fastest storage layer that sits outside the chip, and is typically parceled out in GBytes per compute core.
Solid State Drive (SSD) technology is based on Flash memory, and is the silicon-based, faster substitute to traditional hard drives. SSDs are typically sized at hundreds of GBytes (with some units just under a terabyte) but are not as fast as DRAM.
Hard disk, or "disk," is the workhorse that now scales economically up to 1-3 TBytes per spindle.
So what’s best for which?
For hard drives, conventional wisdom has been that they keep getting faster and cheaper. Turns out, only the latter is true. The cheapness of 1-3 TByte drives has made scale-out Internet data centers possible, and with it, as we said already, scale-out Big Data analytic platforms like Hadoop. Hard disk continues to be the medium of choice for large volumes of data because individual drives routinely scale to 1-3 TBytes. And momentary supply chain disruptions like the 2011 Thailand floods aside, the supply remains more than adequate. Flash drives simply don’t get as fat.
But if anything, hard drives are getting slower because it’s no longer worthwhile to try speeding them up. With flash being at least 10-100-times faster, there’s no way that disk will easily catch up even if the technology gets refreshed. Flash is actually pulling the rug out from under demand for 7200 RPM disks (currently the state of the art for disk).
Given current price trends,some analysts expect Flash to reach parity with disk in the next 12-18 months (or maybe sooner) and there to be less reason for your next transaction system to be disk-based. In fact there is good reason to be a bit skeptical on how soon supply of SSD Flash will ramp up adequately for the transaction system market; but SSD Flash will gradually make its way to prime time. Conversely, with disk likely to remain fatter in capacity than Flash, it will be best suited for active archiving that keeps older data, otherwise bound for tape, live; and for Big Data analytics, where the need is for volume.
Nonetheless, the workhorse of large Hadoop, and similar disk-based Big Data analytic or active archive clusters will likely be the slower 5400 RPM models.
So what about even faster modes of storage? In the past couple years, DRAM memory prices crossed the threshold where it became feasible to deploy them for persistent storage rather than caching of currently used data. That cleared the way for the in-memory database (IMDB), which is often code word for all-DRAM data storage.
In-memory databases are hardly new, but until the last three to four years they were highly specialized. Oracle TimesTen, one of the earliest commercial offerings, was designed for tightly-coupled, specialized transactional applications; other purpose-built in-memory data stores have existed for capital markets for at least a decade or more. But now DRAM memory prices have dropped sufficiently to bring IMDBs into the enterprise mainstream.
Yet, an interesting blip happened on the way to the "inevitable" all in-memory database future: Last spring, DRAM memory prices stopped dropping. In part this was attributable to consolidation of the industry to fewer suppliers. But the larger driver was that the wisdom of crowds — for example, that DRAM memory was now ready for prime time — got ahead of itself. Yes, the laws of supply and demand will eventually shift the trajectory of memory pricing. But nope, that won’t change the fact of life that, no matter how cheap, DRAM memory (and cache) will always be premium storage.
In-memory databases are dead, long live tiered databases
The sky is not the limit for DRAM in-memory databases. The rush to in-memory will morph into an expansion of data tiering. And actually that’s not such a bad thing: do you really need to put all of that data in memory? We think not.
IBM and Teradata have shunned all in-memory architectures; their contention is that the 80/20 rule should govern which data goes into memory. And under their breaths, the all in-memory database folks have fallbacks for paging data between disk and memory. If designed properly, this is not constant paging, but rather a process that only occurs for that rare out-of-range query. Kognitio has a clever pricing model where they don’t charge you for the disk, but just for the volume of memory. As for HANA, disk is designed into the system for permanent offline storage, but SAP quietly adds that it can also be utilized for paging data during routine operation. Maybe SAP shouldn’t be so quiet about that.
There’s one additional form of tiering to consider for highly complex analytics: it’s the boost that can come from pipelining computations inside in-chip cache. Oracle is looking to similar techniques for further optimizing upcoming generations of its Exadata database appliance platform. It’s a technique that’s part of IBM’s recent BLU architecture for DB2. High-performance analytic platforms such as SiSense also incorporate in-chip pipelining to actually reduce balance of system costs (which, for example, require less DRAM).
It’s all about balance of system
Balance of system is hardly new, but until recently, it meant trading off CPU or bandwidth with tiers of disk. Application and database design, in turn, focused on distributing or sharding data to place the most frequently accessed data on the disk or portions of disk that could be accessed the fastest. New forms of storage, including Flash and DRAM memory, add a few new elements to the mix. You’ll still configure storage (along with processor and interconnects) for the application and vice versa, but you’ll have a couple new toys in your arsenal.
For Flash, it means fast OLTP applications that could add basic analytics, such as what Oracle’s recent wave of In-Memory Applications promise. For in-memory, that would dictate OLTP applications with even more complex analytics and/or what-if simulations embedded in line, such as what SAP is promising with its recently-introduced Business Suite and CRM applications on HANA.
For in-memory, we’d contend that for most cases, configurations for keeping 100 percent of data in DRAM will remain overkill. Unless you are running a Big Data analytic problem that is supposed to encompass all of the data, you will likely work with just a fraction of the data. Furthermore, IBM, Oracle, and Teradata are incorporating data skipping features into their analytic platforms that deliberately filter irrelevant data so it is not scanned. There are many ways to speed processing before using the fast storage option.
Storage will become an application design option
Although we’re leery of hopping on the 100 percent DRAM in-memory bandwagon, when smartly deployed, that model could truly transform applications. When you eliminate the latency, you can embed complex analytics in-line with transactional applications, enable the running of more complex analytics, or make it feasible for users to run more what-if simulations to couch their decisions.
Examples include transaction applications that differentiate how to fulfill orders from gold, silver, or bronze-level customers based on levels of services and cost of fulfillment. It could help mitigate risk when making operational or fiduciary decisions by allowing the running of more permutations of scenarios. It could also enhance Big Data analytics by tiering the more frequently used data (and logic) in memory.
Whether to use DRAM or Flash will be a function of data volume and problem complexity. No longer will inclusion of storage tiers be simply a hardware platform design decision; it will become a configuration decision for application designers as well.