HDFS and file system wanderlust

The Hadoop Distributed File System (HDFS) is a pillar of Hadoop. But its single-point-of-failure topology, and its ability to write to a file only once, leaves the Enterprise wanting more. Some vendors are trying to answer the call.

A continuing theme in Big Data is the commonality, and developmental isolation, between the Hadoop world on the one hand and the enterprise data, Business Intelligence (BI) and analytics space on the other.  Posts to this blog -- covering how Massively Parallel Processing (MPP) data warehouse and Complex Event Processing (CEP) products tie in to MapReduce -- serve as examples.

The Enterprise Data and Big Data worlds will continue to collide and, as they do, they'll need to reconcile their differences.  The Enterprise is set in its ways.  And when Enterprise computing discovers something in the Hadoop world that falls short of its baseline assumptions, it may try and work around it.  From what I can tell, a continuing hot spot for this kind of adaptation is Hadoop's storage layer.

Hadoop, at its core, consists of the MapReduce parallel computation framework and the Hadoop Distributed File System (HDFS).  HDFS' ability to federate disks on the cluster's worker nodes, and thus allow MapReduce processing to work against storage local to the node, is a hallmark of what makes Hadoop so cool.  But HDFS files are immutable -- which is to say they can only be written to once.  Also, Hadoop's reliance on a "name node" to orchestrate storage means it has a single point of failure.

Pan to the enterprise IT pro who's just discovering Hadoop, and cognitive dissonance may ensue.  You might hear a corporate database administrator exclaim: "What do you mean I have to settle for write-once read-many?"  This might be followed with outcry from the data center worker: "A single point of failure?  Really?  I'd be fired if I designed such a system."  The worlds of Web entrepreneur technologists and enterprise data center managers are so similar, but their relative seclusion makes for a little bit of a culture clash.

Maybe the next outburst will be "We're mad as hell and we're not going to take it anymore!"  The market seems to be bearing that out.  MapR's distribution of Hadoop features DirectAccess NFS, which provides the ability to use read/write Network File System (NFS) storage in place of HDFS.  Xyratex's cluster file system, called Lustre, can also be used as an API-compatible replacement for HDFS (the company wrote a whole white paper on just that subject, in fact).  Appistry's CloudIQ storage does likewise.  And although it doesn't swap out HDFS, the Apache HCatalog system will provide a unified table structure for raw HDFS files, Pig, Streaming, and Hive.

Sometimes open source projects do things their own way.  Sometimes that gets Enterprise folks scratching their heads.  But given the Hadoop groundswell in the Enterprise, it looks like we'll see a consensus architecture evolve.  Even if there's some creative destruction along the way.