There are lots of ways to run Hadoop, but what if you want to start working with it right away, without the distraction of building a cluster yourself? Your best bet is probably a cloud-based Hadoop cluster, and the Elastic MapReduce (EMR) service on Amazon Web Services (AWS) can get you there pretty speedily.
To get an EMR cluster up and running, you'll need to create an AWS account at http://aws.amazon.com, and you'll want to create a security key pair too. There are several other steps of course, and we'll cover them, one by one, in this gallery.
Amazon refers to the process of standing up an EMR cluster as creating a "job flow." You can do this from the command line, using a technique we'll detail later, but you can also do it from your browser. Just navigate to the EMR home page in the AWS console at https://console.aws.amazon.com/elasticmapreduce, and click the Create New Job Flow button at the top left. Doing so will bring up the Create a New Job Flow dialog box (a wizard, essentially), the first screen of which is shown here.
An EMR cluster can use Amazon's own distribution of Hadoop, or MapR's M3 or M5 distrubution instead. M5 carries a premium billing rate as it not MapR's open source distro.
Those just experimeting with Amazon's Elastic MapReduce can get started immediately by running a sample application, rather than running their own code on their own data. Amazon offers WordCount (the ubiquitous Hadoop sample application) as well as a Hive-based contextual advertising sample, Java and Pig-based log analysis samples and another Java-based sample that looks at data from Amazon's CloudBurst service.
The Specify Parameters screen allows you configure backup options for your HBase cluster, and/or to create the new cluster by restoring from an existing backup.
If you just want to play in the sandbox though, you can disregard the backup options, but make sure to select the Hive and Pig checkboxes in the Install Additional Packages section at the bottom of the screen, then click Continue.
In the Configure EC2 Instances screen, you'll need to select an Instance Type for your Master and Core Instance groups. Amazon's "m1.large" instance type is the minimum required for an EMR cluster. If you're creating a cluster just for learning purposes, this will be your least expensive and therefore most sensible option. Select it for both the Master Instance Group and Core Instance Group.
With your instance types selected, you now need to set the number of instances in your Core and Task Instance groups. Again, if you're just putting up a cluster for learning purposes, you will want to minimize the resources you're using, so change the Core Instance Group's Instance Count from the default setting of 2 to just 1. Leave the same setting for the Task Instance Group at 0, and click Continue.
When you provisioned your AWS account, you should have created at least one EC2 key pair. Pick one for your EMR cluster. Without it, you won't be able to establish a secure terminal session and work interactively with Hadoop. Once you've selected a key pair, click Continue.
You needn't worry about bootstrap actions, so just click Continue through this screen.
In the Review screen, confirm that your instance types, instance counts and key pair configuration are all correct. If not, click the Back link and amend your settings as appropriate. Once everything is correctly configured, click Create Job Flow.
If all goes well, you should see this screen confirming that your EMR job flow has been created. Click Close so that you can monitor the status of your cluster as it's stood up.
The EMR Job Flows screen should display the job flow you just designed. Confirm the state of the job flow is "STARTING." An animated orange spinner should appear in the job flow's row, in the leftmost column in the grid.
Would you rather do all the previous steps in one fell swoop? While there are a number of preparatory steps required, you can. The Amazon Web Services Elastic MapReduce Command Line Interface (AWS EMR CLI) makes all the previous interactive selections completely scriptable. Amazon provides complete instructions for downloading the CLI and completing all prerequisite steps, including creating an AWS account, configuring credentials and setting up a Simple Storage Service (S3) "bucket" for your log files.
If you're running on Windows, download and install Ruby 1.8.7 (which the EMR CLI relies upon), then download and install the EMR CLI itself. From a Command window (a.k.a. DOS prompt), you'll be able to navigate to the EMR CLI's installation folder and enter a command like the one shown here, which creates an EMR job flow with Hive, Pig and HBase, based on an m1.large EC2 instance.
If you're clever, you can embed all of this in a Windows batch (.BAT) file, and create a shortcut to it on your desktop. Form there, your Hadoop cluster is only a double-click away.
Once the job flow is created, proceed to the EC2 Instances screen as you would have were the job flow created interactively...
Watching the job flow's progress is useful, but you'll need some details about the particular EC2 instance serving as the head node in your cluster. Therefore, click the traingle to the right of the Services menu option, then click on the EC2 option in the resulting drop-down panel.
There's one more step required to get to a status screen for your running EC2 instances: in the EC2 dashboard, click the Instances link along the left bav bar.
In the instances screen, select your instance from the top grid. As soon as you do, details about your instance appear below. One such detail is the instance's Internet host name, which you can select and copy. Once the status of your instance is "running," you're ready to connect to the cluster and start using it.
If you're on Windows, you'll want to download, install and then run PuTTY, the de facto SSH (Secure SHell) client for that OS. Once it's running, paste your instance's host name into the Host Name field in the PuTTY Configuration screen.
Rememeber that key file you selected when you configured your job flow? Now you need to select its private key file in PuTTY's SSH authentication screen. Select Connection\SSH\Auth from the Category tree view control on the left, then click the Browse button and navigate to the file.
The file will need to be in PPK format, conversion to which can be performed by the PuTTYgen utility that accompanies PuTTY, as described in the EMR CLI instructions. After you've selected the file, click Open to begin your SSH session with your EMR cluster.
Right after you click the Open button, you'll probably see a scary-looking dialog like this one. But have no fear, as it's actually harmless. If you click Yes to add the server's host key to PuTTY's cache, you won't see this message again for this particular job flow.
You're almost there! When you see the "login as:" prompt in PuTTY's terminal window, enter "hadoop" (without the quotes) and tap Enter. That should log you in.
Upon successful login, you should see a welcome screen and be presented with a command prompt. The message telling you how to gain access to the Hadoop "UI" should be taken with a grain of salt, however, as that user interface is presented in Lynx, a text-based Web browser.
Switch to the bin folder (using the "cd bin" command) and list its contents (using the "ls" command). You will see that Hadoop, HBase, Hive and Pig are all neatly installed for you.
They're ready to run, too. To check this out, enter the "hive" command, and you'll be placed at Hive's command line prompt.
Use the "hbase shell" command to get to the HBase prompt or use the "pig" command to get to the Pig prompt (called "grunt").
Although not shown here, you can also use the "hadoop fs" command to perform Hadoop Distributed File System (HDFS) operations and, of course, the "hadoop jar" command to run a Hadoop MapReduce job.
When you're all done, don't forget to terminate the instances in your cluster, otherwise you will continue to be billed for them! To terminate the instances, you'll first need to select the Change Termination Protection option in the Actions menu, shown here.
Now click the Yes, Disable button.
Now you're ready to terminate the instance. Select the Terminate option from the Actions menu.
You're one click away now. Just click the Yes, Terminate button to initiate shutdown.
The instances screen will show you that your EC2 instance(s) is (are) now shutting down.
An instance's state will change to terminated once it has been fully shut down and de-provisioned. From this point on, you won't be billed.
You can repeat this entire procedure any time you want to do some Hadoop work. EMR clusters are available on-demand and you can shut them down as soon as you're done using them.
For a more permanent cloud-based cluster, you can provision raw EC2 instances, installing Hadoop on them yourself and federating them into a cluster.
Cloud-based, on-demand Hadoop cluster services comparable to AWS EMR will soon be avaiable from Microsoft, Google and Rackspace, leaving fodder for future galleries here on ZDNet.