Big Data on Amazon: Elastic MapReduce, step by step

Big Data on Amazon: Elastic MapReduce, step by step

Summary: Curious how to go about doing Hadoop in Amazon's cloud? Here's some guidance.

TOPICS: Big Data

 |  Image 12 of 29

  • Thumbnail 1
  • Thumbnail 2
  • Thumbnail 3
  • Thumbnail 4
  • Thumbnail 5
  • Thumbnail 6
  • Thumbnail 7
  • Thumbnail 8
  • Thumbnail 9
  • Thumbnail 10
  • Thumbnail 11
  • Thumbnail 12
  • Thumbnail 13
  • Thumbnail 14
  • Thumbnail 15
  • Thumbnail 16
  • Thumbnail 17
  • Thumbnail 18
  • Thumbnail 19
  • Thumbnail 20
  • Thumbnail 21
  • Thumbnail 22
  • Thumbnail 23
  • Thumbnail 24
  • Thumbnail 25
  • Thumbnail 26
  • Thumbnail 27
  • Thumbnail 28
  • Thumbnail 29
  • Job flow created

    If all goes well, you should see this screen confirming that your EMR job flow has been created.  Click Close so that you can monitor the status of your cluster as it's stood up.

  • Job flows

    The EMR Job Flows screen should display the job flow you just designed.  Confirm the state of the job flow is "STARTING."  An animated orange spinner should appear in the job flow's row, in the leftmost column in the grid.

  • The command line

    Would you rather do all the previous steps in one fell swoop?  While there are a number of preparatory steps required, you can.  The Amazon Web Services Elastic MapReduce Command Line Interface (AWS EMR CLI) makes all the previous interactive selections completely scriptable.  Amazon provides complete instructions for downloading the CLI and completing all prerequisite steps, including creating an AWS account, configuring credentials and setting up a Simple Storage Service (S3) "bucket" for your log files.

    If you're running on Windows, download and install Ruby 1.8.7 (which the EMR CLI relies upon), then download and install the EMR CLI itself.  From a Command window (a.k.a. DOS prompt), you'll be able to navigate to the EMR CLI's installation folder and enter a command like the one shown here, which creates an EMR job flow with Hive, Pig and HBase, based on an m1.large EC2 instance.

    If you're clever, you can embed all of this in a Windows batch (.BAT) file, and create a shortcut to it on your desktop.  Form there, your Hadoop cluster is only a double-click away.

    Once the job flow is created, proceed to the EC2 Instances screen as you would have were the job flow created interactively...

Topic: Big Data

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories


Log in or register to start the discussion