Setting up Hadoop Cluster on Cloud

This blog focuses on Setting up a Hadoop Cluster on Cloud. Before we start with the configuration, we need to have a Linux platform in cloud. We will setup our Pseudo mode Hadoop cluster on AWS ec2 Instance.

Note: Here we are assuming that you have an active AWS account and your Linux Instance is running. Also, make sure you have public and private key pair of that Instance.

Connecting to Linux Instance from Windows Using PuTTY

If you are using Windows, you can connect your Instance using PuTTY. After you launch your Instance, you can connect to it and use it the way that you would use a desktop computer.

Before you connect your Instance through PuTTY you need to complete the following prerequisites:

Step 1: Install PuTTY for Windows

Wondering what PuTTY is? PuTTY is an open-source software that is available along with the source code. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. You can download PuTTY by visiting this link.

Putty Download

For more information, you can also visit homepage of PuTTY.

Step 2: Generate PuTTY Private Key ( .ppk )

Putty does not support AWS private key format (.pem) generated by Amazon EC2. To concect your Instance with PuTTY, you need a PuTTY format key (.ppk). For this, PuTTY has a tool called PuTTYyGen, which converts the (.pem) AWS key pair into PuTTY formatted key pair (.ppk).

Here are the steps to generate PuTTy formatted key pair (.ppk):

(a) Download PuTTYgen. You can download PuTTyGen from this link.

Putty Generator

(b) Launch the PuTTyGen tool and locate your Amazon formatted public and private key pair by pressing Load.

Putty key load

(c) You will see an image similar to when you Load your .pem key.

putty key load

(d) Click on Save private key and save it on your Desktop.

Save ppk Key

Your private key is now in the correct format and can be used with PuTTY. You can now connect to your instance using PuTTY’s SSH client.


Step 3: Start your Putty Session

Start PuTTY and you will see a window like shown below.

Putty Configuration

Step 4:  Enter your Host Name (or IP address) of your Instance

In the Category panel, expand Connection, expand SSH, and then select Auth, and follow the below instructions:

  1. Click Browse
  2. Locate your putty private key (.ppk)
  3. Click open

If you want to start your session later, you can also save your session.

Step 5:   Provide permission.

During the first time, it will ask for permission. Click Yes. When it prompts for a login name, type ec2-user and press enter.

Now your session has been started successfully. You are able to use your Instance. And you can start your single node Hadoop Installation and configuration.

ec2-user login

You can also refer to the below AWS documentations if you are facing any problem related to the Amazon EC2 Instance.


Step by step Hadoop Configuration

Before proceeding, let’s look at the prerequisites.

  1.  Java Package
  2.  Hadoop Package

Here’s the step-by-step tutorial:

1. Use the below link to download JDK in the Windows machine using the browser present.

2. On clicking the above link, a screen will prompt you to select the required version. Select the option shown with the red colored arrow symbol.

Java Download

On clicking the above option, download will start and get saved in Downloads folder.

3. Download the Hadoop file using the following link:

On clicking the above link, the below screen will prompt you to select a file.

hadoop download

4. The next step is to connect your Instance. The steps are as follows.

Open your Instance and login as: ec2-user .

ec2-user login

5. Add a new user to install Hadoop. For this, you need root access to add new user, so login as root.

sudo root login

Now you have root access, you can easily add new user. You can do this using the following command.

useradd Acadgild

Next, provide a password to the newly created user, using the following command.

passwd acadgild

Make acadgild user as a sudo user, Add a new entry to visudo file below “Allow root to run any commands anywhere” line.

see the image below to for more reference


Now, get back to your ec2-user from root by typing the exit command.

Next, login into your acadgild user using the below command.

Then, enter your password which you have provided above.

acadgild login

6. Now we need Java to install Hadoop. You can install Java directly from ‘Yum’ repository by typing command :-

Here, we are going to copy the zip file of Java and Hadoop from the Windows machine as it has already been downloaded. So, you need WinSCP tool to copy files from Windows machine to your instance. You can use any file transfer tool like:- FileZilla. Here I am going to use WinSCP because I don’t want to configure ftp server and it’s services.

You can download the WinSCP tool from this link


Follow the below instructions to copy your file through WinSCP.

a. Launch WinSCP.

winscp login

b. Enter host name, user name, and make sure the port number is 22

Note: Leave the password field blank, as we are going to login via .ppk file.

c. Click on the Advance option to import your .ppk file

advance login

d. Expand the ssh category and click on authentication. You will see a window as shown below. Browse your PuTTY formatted private key and locate the (.ppk) file.


e. Now login into your Instance and locate the files from your PC to Amazon Instance. Here I have already uploaded Hadoop and Java zip file into Instance.


WinSCP Upload

When these files are uploaded into the instance, then return to your instance and type ls to check whether the files are available or not.


Copy these files from ec2-user to acadgild user.

copy jdk to /home/acadgild

cp hadoop /home/acadgild

Now, login to acadgild user and extract these files.


Now, we will mess-up with Hadoop properties :

  • Update the .bashrc file with the required environment variables, including Java and Hadoop path.

Type the command sudo vi .bashrc from home directory /home/acadgild.


Note: Update the path present in your system.

  • Type the command source .bashrc to make the environmental variables work.

source bashrc

Note: The java path set in .bashrc will vary for every system, you must give the path of Java where it is has been downloaded and extracted, i.e. /path-to-extracted-java folder.

Example: /home/acadgild/jdk1.8.0_65

  • Create two directories to store NameNode metadata and DataNode blocks as shown below:

Next, Change the permissions of the directory to 755.

  • Change the directory to the location where Hadoop is installed.

cd hadoop

  • Open and add the Java home (path) and Hadoop home (path) in it.

Note: Update the Java version and path of the Java present in your system, in our case the version is 1.8 and location is /usr/lib/jvm/jdk1.8.0_65.


  • Open Core-site.xml using the below command, from the path shown in the screenshot.

vi core-site.xml

Add the below properties in between configuration tag of core-site.xml



  • Open the hdfs-site.xml and add the following lines in between configuration tags.


  • Open the Yarn-site.xml and add the following lines in between configuration tags.


  • Copy the mapred-site.xml template into mapred-site.xml 

    cp mapred-site.xml


And then, add the following properties as shown in mapred-site.xml.

vi mapred-site.xml

Simply edit mapred-site.xml same as below property :-

  • Generate ssh key for Hadoop user.

You can refer to the below screenshot for this.

ssh-keygen -t rsa

Note: Ensure to hit enter key after typing the command ssh-keygen -t rsa  and hit enter once again when it asks for file in which to save the key and for passphrase.

  • Copy the public key from .ssh directory to the authorized_keys folder.

Change the directory to .ssh and then type the below command to copy the files into the authorized _keys folder. Then type the command ls to check whether authorized_keys folder has been created or not.

cat >> .ssh/authorized

  • Change the permission of the .ssh directory.

  • Restart the ssh service by typing the below command.

  • Format the NameNode:

hadoop namenode -format

Change the directory to the location of Hadoop.


Note: Change the directory to sbin of Hadoop before starting the daemon.

 To start all the daemons, follow the below steps:

  • Starting NameNode, DataNode, ResourceManager, NodeManager and Jobhistoryserver

Type the below command to start Namenode

start namenode

  • Next, start the DataNode using the below command.

start datanode

  • Now, Start the ResourceManager using the following command.

start resourcemanager

  • Next, start the NodeManager.


  • Starting Job historyserver

start historyserver

  • Type ‘jps’ command to see running daemons:-


Here, We can see All the Daemons are running, It means we have configured pseudo mode Hadoop Cluster on AWS Instance.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s