Hadoop, or formally Apache Hadoop, is the popular software for creating clusters of computers to handle large data sets, known familiarly as big data. It is an open source project that has gathered support from Yahoo! (where it began as a project in 2005), Cloudera, Hortonworks, Twitter, LinkedIn, VMware, Intel, and others.
The question today is, "Should you virtualize your Hadoop cluster?" VMware and others have performed benchmarks to try to answer this question. Spoiler alert: The answer is a surprising "Yes".
If you're still pondering Hadoop for your business and you don't quite know where it fits, here is the description of the project from the Apache Hadoop website:
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
You might still be in the "kicking the tires" phase of your look at Hadoop and that's fine. This post isn't an introductory look at Hadoop; there are plenty of those available to you. This post focuses on the results of a VMware benchmark study performed on VMware vSphere 5.1.
[July 21, 2015: UPDATE] VMware just sent me an updated study using vSphere 6. It contains some of my predictions and higher numbers of VMs per host.
You can read the hardware specifications yourself at the end of the benchmark study, but in case you don't, here is a general overview of what was used.
32 HP DL380 G7 servers for the host cluster with the following configurations:
HP Disk arrays equipped with Seagate ST9146852SS, SAS 6Gb/s, 146GB, 15,000 RPM, 16MB buffer
RHEL 6.1 x86_64
Virtual hardware: version 9
VMware Tools: installed
Disks: Physical Raw Device Mappings
1 VM per host
69632MB, 68000MB reserved, 16 vCPUs, 12 data disks
No pinning, virtual NUMA, numa.memory.gransize = 1024
2 VMs per host
34816MB, 34000MB reserved, 8 vCPUs, 6 data disks
1 VM pinned to each NUMA node
4 VMs per host
17408MB, 17000MB reserved, 4 vCPUs, 3 data disks
2 VMs pinned to each NUMA node
VMware also provides you with a list of best practices for setting up a virtual Hadoop cluster. Creating such a cluster requires some advanced architecture design for VM sizing, kernel parameter tweaking, host configuration optimization, and some adaptability on your part. In other words, creating a high performance Hadoop cluster is not for the average VMware administrator.
If you examine the hardware specs closely, you'll note that these systems are not HP's latest technology nor are they the maxed-out, souped-up megamachines you would toss at a project like this. In virtualization terms, these would be considered average host systems.
The results are surprising and hopeful for the future of big data virtualization solutions. The following graph shows three benchmark results compared to native performance. The graph not only compares benchmarks of VM performance to native, but it also compares among the three different configurations: 1VM per host, 2 VMs per host, and 4 VMs per host.
As you can clearly see from the graph, performance is close to native or better than native in every case. Native performance is 1.0.
You might be asking yourself, "Why virtualize a single VM per host?" My answer is that it provides a baseline for the study. Ordinarily no one would build a virtual host that only hosts a single VM. However, from these results, the answer might be that performance is actually better on a VM than on native hardware. Surprising, but true.
For my money, I'd go with more VMs per host regardless of what the numbers tell me because I want to better leverage my hardware dollars with more VMs per host. I think for the maximum bang for the buck, four VMs per host is still very exciting. I wish that the study would have gone to eight VMs per host to observe the differences in performance at that level. My guess is that the VMs would have to be downsized quite a bit to accommodate the added VM configuration. At four VMs per host, you have multiplied your big data power into 128 nodes.
The data tells me that virtualizing Hadoop can be done and it can be done in such a way that you can efficiently multiply your computing power without sacrificing performance.
And I think that if you deploy G8 systems with more processors, more RAM, and SSDs, you would blow this benchmark out of the water and use at least eight VMs per host with native performance or better. 256 Hadoop nodes running on 32 systems is a huge amount of computing power for big data sets.
So if someone asks you your opinion about virtualizing Hadoop, you can tell them that not only is it possible, but it's a really good thing to do.