Big Data on Amazon: Elastic MapReduce, step by step

Curious how to go about doing Hadoop in Amazon's cloud? Here's some guidance.
By Andrew Brust, Contributor on
1 of 29 Andrew Brust/ZDNet

Hadoop on Amazon's Cloud

There are lots of ways to run Hadoop, but what if you want to start working with it right away, without the distraction of building a cluster yourself?  Your best bet is probably a cloud-based Hadoop cluster, and the Elastic MapReduce (EMR) service on Amazon Web Services (AWS) can get you there pretty speedily.

To get an EMR cluster up and running, you'll need to create an AWS account at http://aws.amazon.com, and you'll want to create a security key pair too.  There are several other steps of course, and we'll cover them, one by one, in this gallery.

2 of 29 Andrew Brust/ZDNet

Pick a distro

Amazon refers to the process of standing up an EMR cluster as creating a "job flow."  You can do this from the command line, using a technique we'll detail later, but you can also do it from your browser.  Just navigate to the EMR home page in the AWS console at https://console.aws.amazon.com/elasticmapreduce, and click the Create New Job Flow button at the top left.  Doing so will bring up the Create a New Job Flow dialog box (a wizard, essentially), the first screen of which is shown here.

An EMR cluster can use Amazon's own distribution of Hadoop, or MapR's M3 or M5 distrubution instead.  M5 carries a premium billing rate as it not MapR's open source distro.

3 of 29 Andrew Brust/ZDNet

Sample applications

Those just experimeting with Amazon's Elastic MapReduce can get started immediately by running a sample application, rather than running their own code on their own data.  Amazon offers WordCount (the ubiquitous Hadoop sample application) as well as a Hive-based contextual advertising sample, Java and Pig-based log analysis samples and another Java-based sample that looks at data from Amazon's CloudBurst service.

4 of 29 Andrew Brust/ZDNet

Run your own app

If you need to do production work, or just want to conduct a more free-form Hadoop experiment, you'll want to select the option to run your own application.  Picking HBase and clicking Continue is best, as this lets you add Hive and Pig as well.

5 of 29 Andrew Brust/ZDNet

Specify Parameters

The Specify Parameters screen allows you configure backup options for your HBase cluster, and/or to create the new cluster by restoring from an existing backup.

If you just want to play in the sandbox though, you can disregard the backup options, but make sure to select the Hive and Pig checkboxes in the Install Additional Packages section at the bottom of the screen, then click Continue.

6 of 29 Andrew Brust/ZDNet

Configure EC2 instances

In the Configure EC2 Instances screen, you'll need to select an Instance Type for your Master and Core Instance groups.  Amazon's "m1.large" instance type is the minimum required for an EMR cluster.  If you're creating a cluster just for learning purposes, this will be your least expensive and therefore most sensible option.  Select it for both the Master Instance Group and Core Instance Group.

7 of 29 Andrew Brust/ZDNet

Instance counts

With your instance types selected, you now need to set the number of instances in your Core and Task Instance groups.  Again, if you're just putting up a cluster for learning purposes, you will want to minimize the resources you're using, so change the Core Instance Group's Instance Count from the default setting of 2 to just 1.  Leave the same setting for the Task Instance Group at 0, and click Continue.

8 of 29 Andrew Brust/ZDNet

Advanced options

When you provisioned your AWS account, you should have created at least one EC2 key pair.  Pick one for your EMR cluster.  Without it, you won't be able to establish a secure terminal session and work interactively with Hadoop.  Once you've selected a key pair, click Continue.

9 of 29 Andrew Brust/ZDNet

Bootstrap actions

You needn't worry about bootstrap actions, so just click Continue through this screen.

10 of 29 Andrew Brust/ZDNet


In the Review screen, confirm that your instance types, instance counts and key pair configuration are all correct.  If not, click the Back link and amend your settings as appropriate.  Once everything is correctly configured, click Create Job Flow.

11 of 29 Andrew Brust/ZDNet

Job flow created

If all goes well, you should see this screen confirming that your EMR job flow has been created.  Click Close so that you can monitor the status of your cluster as it's stood up.

12 of 29 Andrew Brust/ZDNet

Job flows

The EMR Job Flows screen should display the job flow you just designed.  Confirm the state of the job flow is "STARTING."  An animated orange spinner should appear in the job flow's row, in the leftmost column in the grid.

13 of 29 Andrew Brust/ZDNet

The command line

Would you rather do all the previous steps in one fell swoop?  While there are a number of preparatory steps required, you can.  The Amazon Web Services Elastic MapReduce Command Line Interface (AWS EMR CLI) makes all the previous interactive selections completely scriptable.  Amazon provides complete instructions for downloading the CLI and completing all prerequisite steps, including creating an AWS account, configuring credentials and setting up a Simple Storage Service (S3) "bucket" for your log files.

If you're running on Windows, download and install Ruby 1.8.7 (which the EMR CLI relies upon), then download and install the EMR CLI itself.  From a Command window (a.k.a. DOS prompt), you'll be able to navigate to the EMR CLI's installation folder and enter a command like the one shown here, which creates an EMR job flow with Hive, Pig and HBase, based on an m1.large EC2 instance.

If you're clever, you can embed all of this in a Windows batch (.BAT) file, and create a shortcut to it on your desktop.  Form there, your Hadoop cluster is only a double-click away.

Once the job flow is created, proceed to the EC2 Instances screen as you would have were the job flow created interactively...

14 of 29 Andrew Brust/ZDNet

Go to EC2 instances screen

Watching the job flow's progress is useful, but you'll need some details about the particular EC2 instance serving as the head node in your cluster.  Therefore, click the traingle to the right of the Services menu option, then click on the EC2 option in the resulting drop-down panel.

15 of 29 Andrew Brust/ZDNet

Select instances

There's one more step required to get to a status screen for your running EC2 instances: in the EC2 dashboard, click the Instances link along the left bav bar.

16 of 29 Andrew Brust/ZDNet

Instances screen

In the instances screen, select your instance from the top grid.  As soon as you do, details about your instance appear below.  One such detail is the instance's Internet host name, which you can select and copy.  Once the status of your instance is "running," you're ready to connect to the cluster and start using it.

17 of 29 Andrew Brust/ZDNet

Enter host name

If you're on Windows, you'll want to download, install and then run PuTTY, the de facto SSH (Secure SHell) client for that OS.  Once it's running, paste your instance's host name into the Host Name field in the PuTTY Configuration screen.

18 of 29 Andrew Brust/ZDNet

Private key file

Rememeber that key file you selected when you configured your job flow?  Now you need to select its private key file in PuTTY's SSH authentication screen.  Select Connection\SSH\Auth from the Category tree view control on the left, then click the Browse button and navigate to the file.

The file will need to be in PPK format, conversion to which can be performed by the PuTTYgen utility that accompanies PuTTY, as described in the EMR CLI instructions.  After you've selected the file, click Open to begin your SSH session with your EMR cluster.

19 of 29 Andrew Brust/ZDNet

Add host key

Right after you click the Open button, you'll probably see a scary-looking dialog like this one.  But have no fear, as it's actually harmless.  If you click Yes to add the server's host key to PuTTY's cache, you won't see this message again for this particular job flow.

20 of 29 Andrew Brust/ZDNet

Log in!

You're almost there!  When you see the "login as:" prompt in PuTTY's terminal window, enter "hadoop" (without the quotes) and tap Enter.  That should log you in.

21 of 29 Andrew Brust/ZDNet

Welcome to your cluster

Upon successful login, you should see a welcome screen and be presented with a command prompt.  The message telling you how to gain access to the Hadoop "UI" should be taken with a grain of salt, however, as that user interface is presented in Lynx, a text-based Web browser.

22 of 29 Andrew Brust/ZDNet

The bin folder

Switch to the bin folder (using the "cd bin" command) and list its contents (using the "ls" command).  You will see that Hadoop, HBase, Hive and Pig are all neatly installed for you.

They're ready to run, too.  To check this out, enter the "hive" command, and you'll be placed at Hive's command  line prompt.

23 of 29 Andrew Brust/ZDNet

HBase prompt and "grunt"

Use the "hbase shell" command to get to the HBase prompt or use the "pig" command to get to the Pig prompt (called "grunt").

Although not shown here, you can also use the "hadoop fs" command to perform Hadoop Distributed File System (HDFS) operations and, of course, the "hadoop jar" command to run a Hadoop MapReduce job.

24 of 29 Andrew Brust/ZDNet

Change termination protection

When you're all done, don't forget to terminate the instances in your cluster, otherwise you will continue to be billed for them!  To terminate the instances, you'll first need to select the Change Termination Protection option in the Actions menu, shown here.

25 of 29 Andrew Brust/ZDNet

Disable termination protection

Now click the Yes, Disable button.

26 of 29 Andrew Brust/ZDNet

Terminate your instance

Now you're ready to terminate the instance.  Select the Terminate option from the Actions menu.

27 of 29 Andrew Brust/ZDNet

Yes, terminate

You're one click away now.  Just click the Yes, Terminate button to initiate shutdown.

28 of 29 Andrew Brust/ZDNet

Shutting down

The instances screen will show you that your EC2 instance(s) is (are) now shutting down.

29 of 29 Andrew Brust/ZDNet


An instance's state will change to terminated once it has been fully shut down and de-provisioned.  From this point on, you won't be billed.

You can repeat this entire procedure any time you want to do some Hadoop work.  EMR clusters are available on-demand and you can shut them down as soon as you're done using them.

For a more permanent cloud-based cluster, you can provision raw EC2 instances, installing Hadoop on them yourself and federating them into a cluster.

Cloud-based, on-demand Hadoop cluster services comparable to AWS EMR will soon be avaiable from Microsoft, Google and Rackspace, leaving fodder for future galleries here on ZDNet.

Related Galleries

Azure Synapse Analytics data lake features: up close

Related Galleries

Azure Synapse Analytics data lake features: up close

19 Photos
Pitfalls to Avoid when Interpreting Machine Learning Models

Related Galleries

Pitfalls to Avoid when Interpreting Machine Learning Models

8 Photos
When chatbots are a very bad idea
When Chatbots are a very bad idea ZDNet

Related Galleries

When chatbots are a very bad idea

6 Photos
How ubiquitous AI will permeate everything we do without our knowledge.
How ubiquitous AI will permeate everything we do without our knowledge ZDNet

Related Galleries

How ubiquitous AI will permeate everything we do without our knowledge.

6 Photos
Streaming becomes mainstream

Related Galleries

Streaming becomes mainstream

3 Photos
Photos: How FC Barcelona uses football player data to win games

Related Galleries

Photos: How FC Barcelona uses football player data to win games

8 Photos
Heart and sleep apps that work with the Apple Watch

Related Galleries

Heart and sleep apps that work with the Apple Watch

7 Photos