This guest post comes courtesy of Tony Baer’s OnStrategies blog. Tony is a principal analystat Ovum.
By Tony Baer Of the 3 “V’s” of Big Data – volume, variety, velocity (we’d add "Value" as the 4th V) – velocity has been the unsung ‘V.’ With the spotlight on Hadoop, the popular image of Big Data is large petabyte data stores of unstructured data (which are the first two V’s). While Big Data has been thought of as large stores of data at rest, it can also be about data in motion.
“Fast Data” refers to processes that require lower latencies than would otherwise be possible with optimized disk-based storage. Fast Data is not a single technology, but a spectrum of approaches that process data that might or might not be stored. It could encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk.
Fast Data is nothing new, but because of the cost of memory, was traditionally restricted to a handful of extremely high-value use cases. For instance:
Wall Street firms routinely analyze live market feeds, and in many cases, run sophisticated complex event processing (CEP) programs on event streams (often in real time) to make operational decisions.
Telcos have handled such data in optimizing network operations while leading logistics firms have used CEP to optimize their transport networks.
In-memory databases, used as a faster alternative to disk, have similarly been around for well over a decade, having been employed for program stock trading, telecommunications equipment, airline schedulers, and large destination online retail (e.g., Amazon).
Hybrid in-memory and disk have also become commonplace, especially amongst data warehousing systems (e.g., Teradata, Kognitio), and more recently among the emergent class of advanced SQL analytic platforms (e.g., Greenplum, Teradata Aster, IBM Netezza, HP Vertica, ParAccel) that employ smart caching in conjunction with a number of other bells and whistles to juice SQL performance and scaling (e.g., flatter indexes, extensive use of various data compression schemes, columnar table structures, etc.). Many of these systems are in turn packaged as appliances that come with specially tuned, high-performance backplanes and direct attached disk.
Finally, caching is hardly unknown to the database world. Hot spots of data that are frequently accessed are often placed in cache, as are snapshots of database configurations that are often stored to support restore processes, and so on.
So what’s changed?
The usual factors: the same data explosion that created the urgency for Big Data is also generating demand for making the data instantly actionable. Bandwidth, commodity hardware and, of course, declining memory prices, are further forcing the issue: Fast Data is no longer limited to specialized, premium use cases for enterprises with infinite budgets.
Not surprisingly, pure in-memory databases are now going mainstream: Oracle and SAP are choosing in-memory as one of the next places where they are establishing competitive stakes: SAP HANA vs. Oracle Exalytics. Both Oracle and SAP for now are targeting analytic processing, including OLAP (by raising the size limits on OLAP cubes) and more complex, multi-stage analytic problems that traditionally would have required batch runs (such as multivariate pricing) or would not have been run at all (too complex, too much delay). More to the point, SAP is counting on HANA as a major pillar of its stretch goal to become the #2 database player by 2015, which means expanding HANA’s target to include next generation enterprise transactional applications with embedded analytics.
Potential use cases for Fast Data could encompass:
A homeland security agency monitoring the borders requiring the ability to parse, decipher, and act on complex occurrences in real time to prevent suspicious people from entering the country
Capital markets trading firms requiring real-time analytics and sophisticated event processing to conduct algorithmic or high-frequency trades
Entities managing smart infrastructure which must digest torrents of sensory data to make real-time decisions that optimize use of transportation or public utility infrastructure
B2B consumer products firms monitoring social networks may require real-time response to understand sudden swings in customer sentiment
For such organizations, Fast Data is no longer a luxury, but a necessity.
More specialized use cases are similarly emerging now that the core in-memory technology is becoming more affordable. YarcData, a startup from venerable HPC player Cray Computer, is targeting graph data, which represents data with many-to-many relationships. Graph computing is extremely process-intensive, and as such, has traditionally been run in batch when involving Internet-size sets of data. YarcData adopts a classic hybrid approach that pipelines computations in memory, but persisting data to disk. YarcData is the tip of the iceberg – we expect to see more specialized applications that utilize hybrid caching that combine speed with scale.
But don't forget, memory’s not the new disk
The movement – or tiering – of data to faster or slower media is also nothing new. What is new is that data in memory may no longer be such a transient thing, and if memory is relied upon for in situ processing of data in motion or rapid processing of data at rest, memory cannot simply be treated as the new disk. Excluding specialized forms of memory such as ROM, by nature anything that’s solid state is volatile: there goes your power… and there goes your data. Not surprisingly, in-memory systems such as HANA still replicate to disk to reduce volatility. For conventional disk data stores that increasingly leverage memory, Storage Switzerland’s George Crump makes the case that caching practices must become smarter to avoid misses (where data gets mistakenly swapped out). There are also balance of system considerations: memory may be fast, but is its processing speed well matched with processor? Maybe solid state overcomes I/O issues associated with disk, but may still be vulnerable to coupling issues if processors get bottlenecked or MapReduce jobs are not optimized.
Declining memory process are putting Fast Data on the fast lane to mainstream. But as the technology is now becoming affordable, we’re still early in the learning curve for how to design for it.