There is money to be saved
Main memory treats all data the same. In servers, which typically use some form of error correcting code (ECC) to detect and correct errors, the added cost can be significant with today's large-memory servers.
Researchers at Microsoft and Carnegie Mellon University are studying the issue. Finding that ≈57 percent of data center TCO is capital cost - most of which is server cost - and that processors and memory are about 60 percent of server cost, it's clear that reducing memory costs could materially improve data center capital efficiency.
ECC also slows down systems and, due to added logic and RAM, increases power and cooling costs. It's a double whammy.
The researchers wanted to know if applications all need the level of care that ECC provides and, if they don't, how much could be saved through hetrogenous memory systems. The key is to understand how vulnerable a given workload is to memory errors.
What was that masked error?
Not all memory errors create problems, or are even detected. If the error is overwritten before reading, no one will ever know. Here's their memory error taxonomy:
In their testing the team injected errors into the memory system to determine how harmful they were. If the app crashed or the results were wrong, they considered it a serious error.
They considered three workloads: WebSearch; Memcache; and, GraphLab. All are common apps in Internet scale data centers.
They reached six significant conclusions.
- Error tolerance varies across applications.
- Error tolerance varies within an application.
- Quick-to-crash behavior differs from periodically incorrect behavior.
- Some memory regions are safer than others.
- More severe errors mainly decrease correctness.
- Data recoverability varies across memory regions
After looking at how applications behave under varying memory conditions, the team concluded that:
. . . use of memory without error detection/correction (and the subsequent propagation of errors to persistent storage) is suitable for applications with the following two characteristics: (1) application data is mostly read-only in memory (i.e., errors have a low likelihood of propagating to persistent storage), and (2) the result from the application is transient in nature (i.e., consumed immediately by a user and then discarded).
A number of popular applications, such as search, streaming, gaming and social networking, fit that profile.
Scale makes the difference
After examining the variable, the authors conclude that heterogenous memory can save up to 4.7 percent on server costs, while still giving 99.9 percent server availability.
That may not seem like much, but if you're spending a billion a year on servers it starts to add up. $47M will buy a nice mix of pizza, PhDs, and data center muscle.
The Storage Bits take
Few enterprises have enough scale to make it worth characterizing and altering apps to save 4.7 percent on server costs. But Internet data centers do and will.
These stepwise enhancements - shaving off a couple of percent here and a couple more there - will keep driving IaaS costs down while enterprise costs keep rising. Enterprise IT managers will have to become brokers more than suppliers to their enterprise customers.
The paper is Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory by Yixin Luo, Justin Meza and Onur Mutlu of CMU and Sriram Govindan, Bikash Sharma, Mark Santaniello, Aman Kansal, Jie Liu, Badriddine Khessib and Kushagra Vaid of Microsoft.
Comments welcome, as always.