How good does memory need to be?

How good does memory need to be?

Summary: Main memory is all the same. But why? All data is not created equal, so why is memory? Another reason the cloud is winning.

Screen Shot 2014-07-21 at 10.12.23 PM

There is money to be saved

Main memory treats all data the same. In servers, which typically use some form of error correcting code (ECC) to detect and correct errors, the added cost can be significant with today's large-memory servers.

Researchers at Microsoft and Carnegie Mellon University are studying the issue. Finding that ≈57 percent of data center TCO is capital cost - most of which is server cost - and that processors and memory are about 60 percent of server cost, it's clear that reducing memory costs could materially improve data center capital efficiency.

ECC also slows down systems and, due to added logic and RAM, increases power and cooling costs. It's a double whammy.

The researchers wanted to know if applications all need the level of care that ECC provides and, if they don't, how much could be saved through hetrogenous memory systems. The key is to understand how vulnerable a given workload is to memory errors.

What was that masked error?

Not all memory errors create problems, or are even detected. If the error is overwritten before reading, no one will ever know. Here's their memory error taxonomy:

Special Feature

Storage: Fear, Loss, and Innovation in 2014

Storage: Fear, Loss, and Innovation in 2014

The rise of big data and the demand for real-time information is putting more pressure than ever on enterprise storage. We look at the technologies that enterprises are using to keep up, from SSDs to storage virtualization to network and data center transformation.

In their testing the team injected errors into the memory system to determine how harmful they were. If the app crashed or the results were wrong, they considered it a serious error.

They considered three workloads: WebSearch; Memcache; and, GraphLab. All are common apps in Internet scale data centers.

They reached six significant conclusions.

  • Error tolerance varies across applications.
  • Error tolerance varies within an application.
  • Quick-to-crash behavior differs from periodically incorrect behavior.
  • Some memory regions are safer than others.
  • More severe errors mainly decrease correctness.
  • Data recoverability varies across memory regions

After looking at how applications behave under varying memory conditions, the team concluded that:

. . . use of memory without error detection/correction (and the subsequent propagation of errors to persistent storage) is suitable for applications with the following two characteristics: (1) application data is mostly read-only in memory (i.e., errors have a low likelihood of propagating to persistent storage), and (2) the result from the application is transient in nature (i.e., consumed immediately by a user and then discarded).

A number of popular applications, such as search, streaming, gaming and social networking, fit that profile.

Scale makes the difference

After examining the variable, the authors conclude that heterogenous memory can save up to 4.7 percent on server costs, while still giving 99.9 percent server availability.

That may not seem like much, but if you're spending a billion a year on servers it starts to add up. $47M will buy a nice mix of pizza, PhDs, and data center muscle.

The Storage Bits take
Few enterprises have enough scale to make it worth characterizing and altering apps to save 4.7 percent on server costs. But Internet data centers do and will.

These stepwise enhancements - shaving off a couple of percent here and a couple more there - will keep driving IaaS costs down while enterprise costs keep rising. Enterprise IT managers will have to become brokers more than suppliers to their enterprise customers.

The paper is Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory by Yixin Luo, Justin Meza and Onur Mutlu of CMU and Sriram Govindan, Bikash Sharma, Mark Santaniello, Aman Kansal, Jie Liu, Badriddine Khessib and Kushagra Vaid of Microsoft.

Comments welcome, as always. 

Topics: Storage, Cloud, Hardware, Microsoft

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • ECC does take some power... but not speed penalties.

    Unless you have a really poor implementation. Computation of parity should be done in parallel with CPU data transfer. The same applies for peripheral I/O.

    It is so common and easy, that it should always be used with systems with multiple users, or even multi-tasking.

    Now if you system has less than 1MB of memory... yeah - it could be optional (but use parity instead). But there are nearly no systems (other than the smallest embedded processors) with that little memory.

    Any system that has memory separate from the CPU (which usually eliminates embedded anyway) should use ECC.

    What makes large servers (ie mainframe level) memory expensive isn't ECC - but handling memory page replacement when errors DO occur. This works just like the disks automatic replacement of sectors when an error occurs. Memory flaws can thus be isolated, and operation continues without interruption. But it does mean there needs to be a reasonable number of MBs set aside for replacement memory.
  • I thought as you do, but it seems we were ill-informed.

    From the paper:

    "In terms of performance, existing error detection and correction techniques incur a slowdown on each memory access due to their additional circuitry and up to an additional 10% slowdown due to techniques that operate DRAM at a slower speed to reduce the chances of random bit flips due to electrical interference in higher-density devices. . . . In addition, whenever an error is detected or corrected on modern hardware, the processor raises an interrupt that must be serviced by the system firmware (BIOS), incurring up to 100 μs latency—roughly 2000x a typical 50 ns memory access latency —leading to unpredictable slowdowns."

    They also note that memory errors are rising as well, despite these mechanisms.

    R Harris
  • A world of tradeoffs

    In the old world, before mega-cloud providers, the cost of servicing an error by way of DIMM replacement far exceeded the incremental cost of adding ECC. Plus the inventory cost of spares favored a single type of DIMM per server. This looks to be logistically challenging if the servers are intended to be repaired. If not, and the entire server is disposable then not an issue.

    Other thoughts; not all memory suppliers produce 'L' - less tested devices.

    If memory bit errors are increasing (row hammer anyone?) the efforts to better tolerate errors are well founded. It might not be used to reduce memory cost, but rather extend the reliability architecture and methodologies available today.

    Final comment; Memory prices are at historical highs, once supply catches up with demand the prices will fall and the %BOM cost contribution from DRAM will halve, and once again this issue will put on the back burner as it has been many times over the last few decades.