Hadoop 3 confronts the realities of storage growth

While Hadoop was designed around commodity infrastructure, Hadoop 3.x confronts the reality that too much cheap storage can get expensive.
Written by Tony Baer (dbInsight), Contributor

When it came to storage, the promises of Hadoop sound redolent of those associated with the early days of nuclear power: commodity storage would be too cheap to meter. The original design premise was that if you could make massively parallel computing linearly scalable and bring it close to the data, using commodity hardware would make storage costs an afterthought. That was the assumption behind Hadoop's original approach to high-availability: with storage so cheap, make data available (and local) with three full replicas of the data.

But of course the world is hardly running out of data; most predictions have it growing geometrically or exponentially. At some point, big data storage must obey the laws of gravity, if not physics; having too much of it will prove costly either in overhead or dollars, and more likely, both.

The Hadoop 3.0 spec acknowledges that, at some point, too much cheap becomes expensive. Hadoop 3.0 is borrowing some concepts from the world of enterprise storage. With erasure coding, Hadoop 3.0 adopts RAID techniques for reducing data sprawl. With erasure coding, you can roughly cut the sprawl of storage compared to tradition 3x replication by about 50%. The price of that, of course, is that while replicated data is immediately available, data managed through RAID approaches must be restored - meaning you won't get failover access instantly. So, in practice, the new erasure coding support of Hadoop 3.0 will likely be utilized as a data tiering strategy, where colder (which is usually older) data is stored on cheaper, often slower media.

For Hadoop, these approaches are not new. EMC has offered its own support of HDFS emulation for its Isilon scale-out file storage for several years. The difference with 3.0 is that the goals of this approach - reducing data sprawl - can now be realized through open source technology. But this means that Hadoop adopters must embrace information lifecycle management processes for deprecating colder data; we expect that providers of data archival and information lifecycle management tooling will fill this newly-created vacuum created by the open source world.

So now that we've come up for a way to contain storage sprawl, we'll raise the urgency for it by scaling out YARN. Hadoop 3.0 adds support for federating multiple YARN clusters. Initial design assumptions for YARN pointed to an upward limit of roughly 10,000 nodes; while only Yahoo exceeds that count today, federating YARN allows for a future where more organizations are likely to breach that wall.

Driving all this, of course, is several factors. First, and most obvious, the generation of data is growing exponentially. But secondly, those organizations that got into Hadoop earlier are by now likely to have multiple clusters, either by design or inertia. And in many cases, that proliferation is abetted by the fact that deployments might be sprawled, not only on premises, but also in the cloud. YARN federation is not necessarily about governance per se, but it could open the door in the Hadoop 4.x generation to support for federation across Hadoop's data governance projects, like Atlas, Ranger, or Sentry.

But wait, there's more.

With 3.0, Hadoop is laying the groundwork for managing more heterogeneous clusters. It's not that you couldn't mix and match different kinds of hardware, period. Until, now, the variants tended to revolve adding new generations of nodes that might pack more compute or storage compared to what you would have already had in place; YARN supported that. But with emergence of GPUs and other specialized hardware for new types of workloads (think deep learning), there was no way for YARN to explicitly manage those allocations. In the 3.0 version, the ground work is being laid for managing new hardware types. For GPUs and FPGAs, that will come to fruition in the 3.1 and 3.2 dot releases, respectively.

Hadoop 3.0 also takes the next step in bolstering NameNode reliability. While Hadoop 2.0 supported deployment of two nodes (one active, one standby), now you can have as many standby NameNodes as you want for more robust high availability/failover. Or as one of my colleagues remarked, it's about bloody time that the Apache community added this important feature.

With 3.0 now in release, the Hadoop community is looking to step up the pace of new releases. And so, two dot releases (3.1 and 3.2) are expected in calendar year 2018. As noted above, there will be more delineation for specific resource types managed by YARN. Along with that, a new YARN Services API will simplify the movement of external workloads by encapsulating configurations for security, scheduling, and resource requests. This will help make YARN friendlier to long-running workloads, which has been its weak spot in the past (YARN came out of MapReduce, which was designed for batch jobs).

Published accounts point to expanded support for Docker containers to run with YARN. The operable motion is making Hadoop a tidier deployment format for data analytics-related code. The door has already been cracked open by MapR, with support for persistent containers. While YARN was set up to separate different workloads, containers provides a greater degree of isolation so that programs written in R or Python, for example, will work on the cluster the way they were designed on the laptop.

That prompts the next question: when will Hadoop's forthcoming support for containerized workloads take the inevitable next step in formalizing support of Kubernetes. As the de facto consensus standard for container orchestration, Kubernetes support could expand Hadoop from a target for BI and analytic queries to a platform that runs applications. Yes, there are informal ways to wire Kubernetes and YARN together, but simplifying this with tighter integration could make this natural combination less of a kluge.

With the cloud offering a range of options for running big data jobs with or without Hadoop, the gauntlet has been thrown down to the Apache community. Enable Hadoop to mix and match infrastructure so it can provide the same freedom of choice and flexibility that cloud computing has already made the norm.

Editorial standards