Hadoop in the Cloud with Amazon, Google and MapR

June was a big month for the cloud and for Hadoop. It was also a big month for the intersection of the two.

June of 2012 was a big month for the cloud. On June 7th, Microsoft re-launched its Windows Azure cloud platform, which now features a full-fledged Infrastructure as a Service (IaaS) offering that accommodates both Windows and Linux. Then on June 28, Google, at its I/O conference, launched Google Compute Engine, its own IaaS cloud platform.

June was also a big month for Hadoop.  The Hadoop Summit took place in San Jose, CA on June 13th and 14th.  Yahoo offshoot Hortonworks used the event as the launchpad for its own Hadoop distribution, dubbed the Hortonworks Data Platform (HDP).  Hortonworks will now work to achieve the same prominence for HDP as Cloudera has achieved for its Cloudera Distribution including Apache Hadoop (CDH).

Don't forget MapR
The Hadoop world isn't all about CDH and HDP though.  Another important distribution out there is the one from MapR, which seeks to make the Hadoop Distributed File System more friendly and addressable as a Network File Storage volume.  This HDFS-to-NFS translation gives HDFS files the read/write random access taken for granted with conventional file systems.

MapR trails Cloudera's distribution by quite a lot, however.  And Hortonworks' distro will probably overtake it quickly, as well.  But don't write MapR off just yet, because in the last couple of weeks it has emerged as an important piece of the Hadoop cloud puzzle.

MapR-ing the cloud
On June 16th, Amazon announced that in addition to its own Hadoop distribution, it now provides the option to use MapR's distro on the temporary clusters provisioned through its Elastic MapReduce service, hosted within its Elastic Compute Cloud (EC2).  Customers can use the "M3" open source community edition of MapR or the Enterprise edition, known as "M5."  M5 carries a non-trivial surcharge but offers Enterprise features like mirrored clusters and the ability to create snapshot-style backups of your cluster.

MapR didn't stop there. Instead, it continued its June cloud crusade, by announcing on June 28th at Google I/O a private beta of its Hadoop distribution running on the Google Compute Engine cloud.  Suddenly the Hadoop distro that many considered an also-ran has become the poster child for Big Data in the cloud.

Cloud Hadoop by Microsoft, or even by yourself
Is that enough Hadoop in the cloud for you?  If not, don't forget the Microsoft-Hortonworks Hadoop distro for Windows, now available through an invitation-only beta on its Azure cloud.  And if you're still not satisfied, check out the Apache Whirr project, which lets you deploy virtually any Hadoop distribution to clusters built from cloud servers on Amazon Web Services or Rackspace Cloud Servers.

Hadoop in the cloud isn't always easy, especially since most cloud platforms have their own BLOB storage facilities that only a cloud vendor-tuned distribution can typically handle.  But Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size.

Hadoop in the cloud is likely to get increasingly popular.  In the future it will be interesting to see if the Hadoop distribution vendors, or the cloud platform vendors, will be the ones to lead the charge.