Building a large-scale share flash infrastructure

Rack-scale integration (RSI) could bring cloud economics to enterprise data centers. But storage is a sticking point. A new software stack makes remote flash almost equivalent to local flash. Here's what you need to know.
Written by Robin Harris, Contributor

RSI, conceptually, blows server components - CPU, GPU, storage, memory, network - across separate racks, enabling each to be sized and upgraded in full rack boxes, connected by high-capacity/low latency, PCIe links. Layer on top of that virtual server software, so virtual servers can be constructed from pieces of the rack-scale components, and you have a game-changing config that makes enterprise infrastructure competitive with cloud services.

But - there's always a but - using non-volatile memory and storage over a network typically means unacceptable latency. That's where a Stanford's team ReFlex - a software storage server - comes in.


There are several tradeoffs in making remote flash access acceptable. The biggest is low latency, but there's also high throughput - saturating an NVMe device with as few CPU cores as possible.

Managing multi-tenancy in a shared flash pool requires isolation, so applications aren't stepping on each others toes. And it's desirable to be highly flexible in sharing the flash, as well as other deployment issues, such as scale and network protocols.

In testing, the Stanford team found that ReFlex achieved remote flash performance equivalent to local flash accesses over 10Gb Ethernet using TCP/IP. They explain

ReFlex achieves high performance with limited compute requirements using a novel dataplane kernel that tightly integrates networking and storage. The dataplane design avoids the overhead of interrupts and data copying, optimizes for locality, and strikes a balance between high throughput (IOPS) and low tail latency.


The big problem with multi-tenancy on flash devices is the huge difference between read and write performance. Writes can take many milliseconds, while reads are sub-millisecond affairs. This means an app that does a lot of writing, say metadata updates or streaming video, uses a lot of an NVMe device's resources.

The Stanford team implemented a QoS scheduler with global visibility into the entire workload across all tenants. The maximum IOPS depends on the read/write ratio of all requests. The scheduler looks at each workload's service level objective (SLO), prioritizes latency critical apps over best effort apps, and ensures that SLOs for apps are met.

As a result of this and other optimizations, ReFlex is capable of serving up to 850K IOPS per core while only adding 21µs of latency over direct access to local flash. That's remarkably good.

The Storage Bits take

Intel's visionaries have been promoting the RSI concept for years, but it looks like 2018 will be the year that all the necessary pieces - PCIe v4 in particular - will come together to make it technically and economically feasible. With the crash in flash prices, racks full of flash are more affordable than they've ever been, even at multi-hundred TB scale.

This is very good news for our data-intensive future. It will be interesting to see if any of the enterprise storage vendors productize something like ReFlex.

Courteous comments welcome, of course. The paper ReFlex: Remote Flash ≈ Local Flash won a Most Memorable paper award at NVMW 19.

Editorial standards