Warp speed file serving with pNFS

Warp speed file serving with pNFS

Summary: Files: quickly getting bigger. Networks: slowly getting faster.


Files: quickly getting bigger. Networks: slowly getting faster. Something's got to give. Here's the scoop.

Parallel NFS: standards-based parallel file serving The Network File System (NFS) is the oldest NAS (Network Attached Storage) protocol. Developed by Sun in the '80s and made an open standard, NFS makes files on the network available anywhere.

Small files: great. Big files: lo-o-o-ng time coming NAS is popular because it uses cheap, reliable and reasonably fast Ethernet instead of cranky, expensive and very fast Fibre Channel. NFS is very popular as the storage protocol for compute clusters. Yet as data sets and file sizes have grown, the relative speed of Ethernet just hasn't kept up.

I worked with some oil companies doing reservoir modeling about six years ago. Even then it was taking them 6-10 hours just to move data from one stage of their workflow to the next. It was killing them.

With 10 gigabit Ethernet coming up, our problems should be solved. But no, NFS had a tough time scaling to gigabit Ethernet. That's why you see TCP Offload Engines (TOEs), custom hardware pipelines and other costly go-fast goodies on gigE storage.

Enter the dragon The Internet Engineering Task Force is the NFS standards body. They started working on developing a parallel version of NFS to enable much higher speeds about four years ago. The new standard, NFS v4.1, should reach final draft status later this year. Some early birds may be out with products late this year as well.

How NFS works Standard NFS file servers work like your PC does: the files are on local disks, and the computer keeps track of their location, name, creation and modification dates, size and so on. The location and so forth is called metadata which means data about your data.

When you request a file, the file server receives the request, looks up the metadata, converts it to disk I/O requests, collects the data and then ships it over the network to you. With small files most of the time is spent collecting the data.

With big files the data transmission time becomes the limiting factor. What if you could break a big file into pieces and ship it in parallel to a compute server? That would be faster, especially with several parallel connections.

That's exactly what parallel NFS (pNFS) does.

How pNFS works pNFS splits the NFS file server into two types of servers: the metadata and control server; and as many storage servers as you can afford. Together the control server and the storage servers form a single logical NFS server with a slew of network connections. The compute server, which is likely to be a Beowulf cluster, also has plenty of Ethernet ports as well.

So the compute server requests a file using the new v4.1 NFS client. The NFS control server receives the request and looks up where the file chunks reside on the various storage servers. It send this information, called a layout, back to the NFS v4.1 client, which then tells its cluster members where to get the data. The cluster members then, using the layout, request the data directly from the storage servers.

If you've got 10 storage servers for a 10 node cluster, you will see something close to a 10x increase in speed. 100 of each and you'll see close to 100x increase. It is almost magic.

AND it's backward compatible You'll still be able to access the data even with a lowly PC. Your NFS client makes the request, the control server gathers the data itself, and sends it on to you. Except for the fact that it is slower than pNFS, you'll never know the difference.

No changes to applications either. The IETF team did a good job on this one.

The Storage Bits take pNFS is going to be very popular in the large-scale high performance computing cluster space. These clusters are so big that adding just a few hundred bucks per node for some tweak quickly adds up.

I fantasize about a home pNFS array for video editing: stick four gigE ports on my local machine and editing large files wouldn't be nearly as painful. But that is a ways off. For the big clusters though, a new day is starting to dawn.

Comments welcome, of course. Like reading specs? The IETF NFS v4.1 specs page will make your day.

Topics: Servers, Networking, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • BitTorrent FS, anyone?

    So, basically, this is the BitTorrent concept added to NFS, right?
    • Hadn't thought of it that way, but you're right.

      I just looked up BitTorrent on wikipedia so now I'm an expert.
      To the extent that both pNFS and BT break files into pieces and store them on
      different servers, yes, they are similar. Since a pNFS server is a complete system, it
      will have higher and more consistent performance than BT.<br>

      pNFS defines interfaces and the basic back-end architecture. I think the
      competition will be on the back-end implementation.
      BT is a fully distributed system, pNFS less so. <br>
      R Harris
  • Similiar to Striping??

    My thoughts. The idea of separating metadata from data is common to Bit Torrent and pNFS as one other comment pointed out. Once the layout is returned by the metadata server, the client/cluster nodes directly access the storage nodes/Data servers and use the CPU,spindles,network pipes on/from all the storage nodes to get the data faster. This sounds more like striping at a different level of abstraction.

    How is a write handled by pNFS? Who creates the layout? I guess I gotta search. Good post Robin!
  • Not the first, but the one with the most potential

    Parallel and shared disk filesystems, both of which use the concept of separating metadata from data, have been around for a long time. IBM GPFS, SGI CXFS, and LSF (now Sun) QFS were part of this trend in the late 1990s. However, none have been highly successful, leading the second wave of object-based parallel NAS file systems like Lustre and IBRIX. But these have had issues as well, which leads us back to NFS.

    pNFS uses standard NFS over IP metadata calls to handle the metadata, and can use NFS or any block transport (iSCSI, FC, etc.) to handle block data. RDMA capabilities are available for both metadata (using NFS v4 NFS over RDMA), and block traffic (using iSCSI Extensions for RDMA over iWARP Ethernet and SCSI RDMA Protocol over InfiniBand, as well as Fibre Channel and emerging technologies like FCoE).

    While initially very useful for HPC, expect over time pNFS to become the preferred way of using NFS. Also, expect pNFS to become the defacto Oracle RAC filesystem for UNIX and Linux.

    With the buzz around 10Gb Ethernet, pNFS is hitting at the right time.