In a brilliant PhD thesis, Understanding and Improving the Latency of DRAM-Based Memory Systems, Kevin K. Chang of CMU tackles the DRAM issue, and suggests some novel architectural enhancements to make substantial improvements in DRAM latency.
Kevin breaks the DRAM latency problem into four issues, three of which I'll summarize here:
- Inefficient bulk data movement.
- DRAM refresh interference. While DRAM is being refreshed, it can't all be accessed.
- Cell latency variation, due to manufacturing variability.
The fourth issue, the impact of power on latency, is left for the interested reader to investigate.
Inefficient bulk data movement
Back when memory and storage were costly, data movement was confined to a register-sized chunks, or, at most, a 512 byte block from disk. But today, with terabytes of storage and gigabytes of memory, with video and streaming data, bulk data movement is ever-more common.
But the architecture of data movement - from memory to CPU over narrow memory busses - hasn't changed. Mr. Chang's suggestion? A new, high-bandwidth data path between sub-arrays of memory, using a few isolation transistors to create a wide - 8,192 bits wide - parallel bus between sub-arrays in the same bank of memory.
DRAM refresh interference
DRAM memory cells need to be refreshed to retain data, which is why it's called Dynamic RAM. DRAM is refreshed in ranks, not all at once, because doing so would require too much power. While a rank is being refreshed, however, it can't be accessed, which creates latency.
DRAM latency is getting worse, because as chip density increases, more ranks need to be refreshed, degrading performance by almost 20 percent on 32Gb chips.
Mr. Chang proposes two mechanisms, that hide refresh latency by parallelizing refreshes with memory accesses across banks and subarrays. One uses out-of-order per-bank refresh that enables the memory controller to specify an idle bank to be refreshed instead of the common strict round-robin order. The second strategy is write-refresh parallelization that overlaps refresh latency with write latency.
In his testbed, with an 8-core CPU, these strategies improved weighted memory performance by more than 27 percent.
Cell latency variation
Thanks to manufacturing variation, memory cells can have substantial performance variations that are also increasing as density rises. But DRAM is specified to be reliable at the speed of the slowest cells, meaning there is a significant performance upside if the fastest cells are used.
Mr. Chang proposes two mechanisms to take advantage of this variation, but space constraints keep me from describing them in detail. Suffice it to say that they achieved speed ups from 13 to almost 20 percent.
The Storage Bits take
The search for bottlenecks - and fixing them - is a never-ending job in system architecture. DRAM has avoided being the bottleneck, but the latency plateau we're seeing says that will change.
As it gets harder to wring performance out of more transistors, specialized instruction sets, and the like, lower DRAM latency becomes a prime target for performance improvement. Let's hope Intel and AMD take notice.
Courteous comments welcome, of course.