Search
  • Videos
  • Windows 10
  • 5G
  • Best VPNs
  • Cloud
  • Security
  • AI
  • more
    • TR Premium
    • Working from Home
    • Innovation
    • Best Web Hosting
    • ZDNet Recommends
    • Tonya Hall Show
    • Executive Guides
    • ZDNet Academy
    • See All Topics
    • White Papers
    • Downloads
    • Reviews
    • Galleries
    • Videos
    • TechRepublic Forums
  • Newsletters
  • All Writers
    • Preferences
    • Community
    • Newsletters
    • Log Out
  • Menu
    • Videos
    • Windows 10
    • 5G
    • Best VPNs
    • Cloud
    • Security
    • AI
    • TR Premium
    • Working from Home
    • Innovation
    • Best Web Hosting
    • ZDNet Recommends
    • Tonya Hall Show
    • Executive Guides
    • ZDNet Academy
    • See All Topics
    • White Papers
    • Downloads
    • Reviews
    • Galleries
    • Videos
    • TechRepublic Forums
      • Preferences
      • Community
      • Newsletters
      • Log Out
  • us
    • Asia
    • Australia
    • Europe
    • India
    • United Kingdom
    • United States
    • ZDNet around the globe:
    • ZDNet France
    • ZDNet Germany
    • ZDNet Korea
    • ZDNet Japan

Big Data on Amazon: Elastic MapReduce, step by step

2 of 29 NEXT PREV
  • Hadoop on Amazon's Cloud

    Hadoop on Amazon's Cloud

    There are lots of ways to run Hadoop, but what if you want to start working with it right away, without the distraction of building a cluster yourself?  Your best bet is probably a cloud-based Hadoop cluster, and the Elastic MapReduce (EMR) service on Amazon Web Services (AWS) can get you there pretty speedily.

    To get an EMR cluster up and running, you'll need to create an AWS account at http://aws.amazon.com, and you'll want to create a security key pair too.  There are several other steps of course, and we'll cover them, one by one, in this gallery.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Pick a distro

    Pick a distro

    Amazon refers to the process of standing up an EMR cluster as creating a "job flow."  You can do this from the command line, using a technique we'll detail later, but you can also do it from your browser.  Just navigate to the EMR home page in the AWS console at https://console.aws.amazon.com/elasticmapreduce, and click the Create New Job Flow button at the top left.  Doing so will bring up the Create a New Job Flow dialog box (a wizard, essentially), the first screen of which is shown here.

    An EMR cluster can use Amazon's own distribution of Hadoop, or MapR's M3 or M5 distrubution instead.  M5 carries a premium billing rate as it not MapR's open source distro.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Sample applications

    Sample applications

    Those just experimeting with Amazon's Elastic MapReduce can get started immediately by running a sample application, rather than running their own code on their own data.  Amazon offers WordCount (the ubiquitous Hadoop sample application) as well as a Hive-based contextual advertising sample, Java and Pig-based log analysis samples and another Java-based sample that looks at data from Amazon's CloudBurst service.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Run your own app

    Run your own app

    If you need to do production work, or just want to conduct a more free-form Hadoop experiment, you'll want to select the option to run your own application.  Picking HBase and clicking Continue is best, as this lets you add Hive and Pig as well.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Specify Parameters

    Specify Parameters

    The Specify Parameters screen allows you configure backup options for your HBase cluster, and/or to create the new cluster by restoring from an existing backup.

    If you just want to play in the sandbox though, you can disregard the backup options, but make sure to select the Hive and Pig checkboxes in the Install Additional Packages section at the bottom of the screen, then click Continue.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Configure EC2 instances

    Configure EC2 instances

    In the Configure EC2 Instances screen, you'll need to select an Instance Type for your Master and Core Instance groups.  Amazon's "m1.large" instance type is the minimum required for an EMR cluster.  If you're creating a cluster just for learning purposes, this will be your least expensive and therefore most sensible option.  Select it for both the Master Instance Group and Core Instance Group.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Instance counts

    Instance counts

    With your instance types selected, you now need to set the number of instances in your Core and Task Instance groups.  Again, if you're just putting up a cluster for learning purposes, you will want to minimize the resources you're using, so change the Core Instance Group's Instance Count from the default setting of 2 to just 1.  Leave the same setting for the Task Instance Group at 0, and click Continue.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Advanced options

    Advanced options

    When you provisioned your AWS account, you should have created at least one EC2 key pair.  Pick one for your EMR cluster.  Without it, you won't be able to establish a secure terminal session and work interactively with Hadoop.  Once you've selected a key pair, click Continue.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Bootstrap actions

    Bootstrap actions

    You needn't worry about bootstrap actions, so just click Continue through this screen.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Review

    Review

    In the Review screen, confirm that your instance types, instance counts and key pair configuration are all correct.  If not, click the Back link and amend your settings as appropriate.  Once everything is correctly configured, click Create Job Flow.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Job flow created

    Job flow created

    If all goes well, you should see this screen confirming that your EMR job flow has been created.  Click Close so that you can monitor the status of your cluster as it's stood up.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Job flows

    Job flows

    The EMR Job Flows screen should display the job flow you just designed.  Confirm the state of the job flow is "STARTING."  An animated orange spinner should appear in the job flow's row, in the leftmost column in the grid.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • The command line

    The command line

    Would you rather do all the previous steps in one fell swoop?  While there are a number of preparatory steps required, you can.  The Amazon Web Services Elastic MapReduce Command Line Interface (AWS EMR CLI) makes all the previous interactive selections completely scriptable.  Amazon provides complete instructions for downloading the CLI and completing all prerequisite steps, including creating an AWS account, configuring credentials and setting up a Simple Storage Service (S3) "bucket" for your log files.

    If you're running on Windows, download and install Ruby 1.8.7 (which the EMR CLI relies upon), then download and install the EMR CLI itself.  From a Command window (a.k.a. DOS prompt), you'll be able to navigate to the EMR CLI's installation folder and enter a command like the one shown here, which creates an EMR job flow with Hive, Pig and HBase, based on an m1.large EC2 instance.

    If you're clever, you can embed all of this in a Windows batch (.BAT) file, and create a shortcut to it on your desktop.  Form there, your Hadoop cluster is only a double-click away.

    Once the job flow is created, proceed to the EC2 Instances screen as you would have were the job flow created interactively...

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Go to EC2 instances screen

    Go to EC2 instances screen

    Watching the job flow's progress is useful, but you'll need some details about the particular EC2 instance serving as the head node in your cluster.  Therefore, click the traingle to the right of the Services menu option, then click on the EC2 option in the resulting drop-down panel.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Select instances

    Select instances

    There's one more step required to get to a status screen for your running EC2 instances: in the EC2 dashboard, click the Instances link along the left bav bar.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Instances screen

    Instances screen

    In the instances screen, select your instance from the top grid.  As soon as you do, details about your instance appear below.  One such detail is the instance's Internet host name, which you can select and copy.  Once the status of your instance is "running," you're ready to connect to the cluster and start using it.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Enter host name

    Enter host name

    If you're on Windows, you'll want to download, install and then run PuTTY, the de facto SSH (Secure SHell) client for that OS.  Once it's running, paste your instance's host name into the Host Name field in the PuTTY Configuration screen.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Private key file

    Private key file

    Rememeber that key file you selected when you configured your job flow?  Now you need to select its private key file in PuTTY's SSH authentication screen.  Select Connection\SSH\Auth from the Category tree view control on the left, then click the Browse button and navigate to the file.

    The file will need to be in PPK format, conversion to which can be performed by the PuTTYgen utility that accompanies PuTTY, as described in the EMR CLI instructions.  After you've selected the file, click Open to begin your SSH session with your EMR cluster.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Add host key

    Add host key

    Right after you click the Open button, you'll probably see a scary-looking dialog like this one.  But have no fear, as it's actually harmless.  If you click Yes to add the server's host key to PuTTY's cache, you won't see this message again for this particular job flow.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Log in!

    Log in!

    You're almost there!  When you see the "login as:" prompt in PuTTY's terminal window, enter "hadoop" (without the quotes) and tap Enter.  That should log you in.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Welcome to your cluster

    Welcome to your cluster

    Upon successful login, you should see a welcome screen and be presented with a command prompt.  The message telling you how to gain access to the Hadoop "UI" should be taken with a grain of salt, however, as that user interface is presented in Lynx, a text-based Web browser.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • The bin folder

    The bin folder

    Switch to the bin folder (using the "cd bin" command) and list its contents (using the "ls" command).  You will see that Hadoop, HBase, Hive and Pig are all neatly installed for you.

    They're ready to run, too.  To check this out, enter the "hive" command, and you'll be placed at Hive's command  line prompt.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • HBase prompt and "grunt"

    HBase prompt and "grunt"

    Use the "hbase shell" command to get to the HBase prompt or use the "pig" command to get to the Pig prompt (called "grunt").

    Although not shown here, you can also use the "hadoop fs" command to perform Hadoop Distributed File System (HDFS) operations and, of course, the "hadoop jar" command to run a Hadoop MapReduce job.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Change termination protection

    Change termination protection

    When you're all done, don't forget to terminate the instances in your cluster, otherwise you will continue to be billed for them!  To terminate the instances, you'll first need to select the Change Termination Protection option in the Actions menu, shown here.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Disable termination protection

    Disable termination protection

    Now click the Yes, Disable button.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Terminate your instance

    Terminate your instance

    Now you're ready to terminate the instance.  Select the Terminate option from the Actions menu.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Yes, terminate

    Yes, terminate

    You're one click away now.  Just click the Yes, Terminate button to initiate shutdown.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Shutting down

    Shutting down

    The instances screen will show you that your EC2 instance(s) is (are) now shutting down.

    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

  • Terminated

    Terminated

    An instance's state will change to terminated once it has been fully shut down and de-provisioned.  From this point on, you won't be billed.

    You can repeat this entire procedure any time you want to do some Hadoop work.  EMR clusters are available on-demand and you can shut them down as soon as you're done using them.

    For a more permanent cloud-based cluster, you can provision raw EC2 instances, installing Hadoop on them yourself and federating them into a cluster.

    Cloud-based, on-demand Hadoop cluster services comparable to AWS EMR will soon be avaiable from Microsoft, Google and Rackspace, leaving fodder for future galleries here on ZDNet.

    • Also read:  Hadoop in the Cloud with Amazon, Google and MapR
    Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

    Caption by: Andrew Brust

2 of 29 NEXT PREV
Andrew Brust

By Andrew Brust for Big on Data | January 7, 2013 -- 14:30 GMT (06:30 PST) | Topic: Big Data Analytics

  • Hadoop on Amazon's Cloud
  • Pick a distro
  • Sample applications
  • Run your own app
  • Specify Parameters
  • Configure EC2 instances
  • Instance counts
  • Advanced options
  • Bootstrap actions
  • Review
  • Job flow created
  • Job flows
  • The command line
  • Go to EC2 instances screen
  • Select instances
  • Instances screen
  • Enter host name
  • Private key file
  • Add host key
  • Log in!
  • Welcome to your cluster
  • The bin folder
  • HBase prompt and "grunt"
  • Change termination protection
  • Disable termination protection
  • Terminate your instance
  • Yes, terminate
  • Shutting down
  • Terminated

Curious how to go about doing Hadoop in Amazon's cloud? Here's some guidance.

Read More Read Less

Pick a distro

Amazon refers to the process of standing up an EMR cluster as creating a "job flow."  You can do this from the command line, using a technique we'll detail later, but you can also do it from your browser.  Just navigate to the EMR home page in the AWS console at https://console.aws.amazon.com/elasticmapreduce, and click the Create New Job Flow button at the top left.  Doing so will bring up the Create a New Job Flow dialog box (a wizard, essentially), the first screen of which is shown here.

An EMR cluster can use Amazon's own distribution of Hadoop, or MapR's M3 or M5 distrubution instead.  M5 carries a premium billing rate as it not MapR's open source distro.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

2 of 29 NEXT PREV

Related Topics:

Digital Transformation Robotics Internet of Things Innovation Enterprise Software CXO
Andrew Brust

By Andrew Brust for Big on Data | January 7, 2013 -- 14:30 GMT (06:30 PST) | Topic: Big Data Analytics

Show Comments
LOG IN TO COMMENT
  • My Profile
  • Log Out
| Community Guidelines

Join Discussion

Add Your Comment
Add Your Comment

Related Galleries

  • 1 of 3
  • Azure Synapse Analytics data lake features: up close

    Microsoft has added a slew of new data lake features to Synapse Analytics, based on Apache Spark. It also integrates Azure Data Factory, Power BI and Azure Machine Learning. These ...

  • Pitfalls to Avoid when Interpreting Machine Learning Models

    Modern requirements for machine learning models include both high predictive performance and model interpretability. A team of experts in explainable AI highlights pitfalls ...

  • When chatbots are a very bad idea

    Not every business problem can be solved by using chatbots. Here are some inappropriate uses for the AI tool.

  • How ubiquitous AI will permeate everything we do without our knowledge.

    Most of us do not know that we are using chatbots to talk to service agents, so how will we know that AI will be seamlessly interacting in with our future lives? ...

  • Streaming becomes mainstream

    The endless streams of data generated by applications lends its name to this paradigm, but also brings some hard to deal with requirements to the table: How do you deal with querying ...

  • Photos: How FC Barcelona uses football player data to win games

    FC Barcelona is focusing on data analysis to give it an edge on the soccer field and at the bank.

  • Heart and sleep apps that work with the Apple Watch

    If you want to track sleep and heart health, these apps will get you going.

ZDNet
Connect with us

© 2021 ZDNET, A RED VENTURES COMPANY. ALL RIGHTS RESERVED. Privacy Policy | Cookie Settings | Advertise | Terms of Use

  • Topics
  • Galleries
  • Videos
  • Sponsored Narratives
  • Do Not Sell My Information
  • About ZDNet
  • Meet The Team
  • All Authors
  • RSS Feeds
  • Site Map
  • Reprint Policy
  • Manage | Log Out
  • Join | Log In
  • Membership
  • Newsletters
  • Site Assistance
  • ZDNet Academy
  • TechRepublic Forums