Thursday, 18 April 2013

Install and Configure Hadoop on Linux


               Configure and Install Hadoop on Linux

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. 

Hadoop has mainly three parts
1 : Hadoop Common :The common utilities that support the other Hadoop subprojects.  
For example hbase,hive,cassendra, pig, zookeeper etc.
2 : Hadoop Distributed File System(HDFS) :A distributed file system that provides high-throughput access to application data.
3 : Hadoop MapReduce :A software framework for distributed processing of large data sets on compute clusters. 
1. NameNode:-Manages the namespace, file system metadata, and access control. There is exactly one 
NameNode in each cluster.
2.SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.
3.JobTracker: - Hands out tasks to the slave nodes. There is exactly one JobTracker in each cluster.
4.DataNode: -Holds file system data. Each data node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.
5.TaskTracker: - Slaves that carry out map and reduce tasks. There are one or more TaskTrackers in each cluster.
                        Installation

Required Software :

1 : java-1.6.x must be installed into your system and set the environment path as JAVA_HOME
2 : ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
For windows operating system

Cygwin- Required for shell support in windows

 --------------------------------------------------------------------------

Download the hadoop stable version from this given link  http://hadoop.apache.org/common/releases.html 

and click on download link, and download the stable version into your prefered
directory, for example it is downloaded into here /usr/local/hadoop-0.20.2.tar.gz

Go to the download directory and untar it with the following command

# cd /usr/local/
# tar xvfz hadoop-0.20.2.tar.gz
# cd hadoop-0.20.2

set the JAVA_HOME path in conf/hadoop-env.sh file as

export JAVA_HOME=/usr/java/jdk1.6.0_18


Now you are ready to start the hadoop cluster into one of three mode

 Local (Standalone) Mode
 Pseudo-Distributed Mode
 Fully-Distributed Mode

Standalone Mode : By Default hadoop is configured to run in non-distributed mode,
as a single java process. It is useful for debugging.

For running in standalone you can easily test it as follows.

# mkdir input
# cp conf/*.xml input
# bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 
# cat output/*

Pseudo-Distributed :Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

If you would like to start the hadoop as a custom linux user for example hadoop

you have to create group hadoop and user hadoop and give the permission for directories. 

# goupadd hadoop
# useradd -g hadoop hadoop
# passwd hadoop

# mkdir /usr/local/hadoop-data

# chown -R hadoop:hadoop hadoop-data
# chown -R hadoop:hadoop hadoop-0.20.2

# su – hadoop

# cd /usr/local


Step 1 : edit the file conf/core-site.xml and add the following line within the
configuration element.
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
    <property>
     <name>hadoop.tmp.dir</name>
     <value>/usr/local/hadoop-data</value>
    </property>
where hadoop-data is directory in /usr/local/, store all file system data
Step 2 : edit the file conf/hdfs-site.xml and add the following line within the
configuration element.
    <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
Step 3 : edit the file conf/mapred-site.xml and add the following line within the
configuration element.
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
Step 4 : Setup passphraseless ssh If you cannot ssh to localhost without a passphrase, execute the following commands:
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
or 
sh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
ssh-copy-id -i ~/.ssh/id_rsa.pub  slave-machines

Step 5 : execute the following command
# bin/hadoop namenode -format
Step 6 : Start the hadoop daemons:
# bin/start-all.sh
after starting the above command, you can view the log from HADOOP_HOME/logs
directory. The above .sh file started the NameNode and the JobTracker; by default they are available at:
Fully-Distributed Mode :

Suppose we want to set up hadoop in cluster mode we have to identified the machines for namenode and datanodes. For example

suppose there are 5 machines in which one machine is namenode and other 4 
are datanodes which have name and ip addresses are given here.

 MachineName                     IP Address
 -------------------------------------------------------
 master-namenode                 192.168.0.50
 datanode-1                      192.168.0.51
 datanode-2                      192.168.0.52   
 datanode-3                      192.168.0.53
 datanode-3                      192.168.0.54

 -------------------------------------------------------

The following steps are given below for installing the hadoop in cluster environment.

Step 1 : Install the pre-require softwares in all machines. Ie
jdk-1.6.x,ssh etc.

Step 2 : Download the hadoop from apache site. Suppose it is downloaded into
/usr/local/, suppose the downloaded version is hadoop-0.20.2.tar.gz

untar it as

# tar xvfz hadoop-0.20.2.tar.gz

Go to the directory 

cd /usr/local/

Step 2 : copy the hadoop-0.20.2 directory into /usr/local/ of all machines including master-namenode

Step 3 : edit /etc/hosts and add the following ip's into this for all machines
192.168.0.50 master-namenode
 192.168.0.51           datanode-1                      
 192.168.0.52           datanode-2                        
 192.168.0.53           datanode-3                      
 192.168.0.54           datanode-3                    


Step 4 : create a user named hadoop in all machines and create a directory named

hadoop-data  in /usr/local/ for data storage of hadoop i.e.

execute the following command in all machines

# groupadd hadoop
# useradd -g hadoop hadoop
# chown -R hadoop:hadoop /usr/local/hadoop-0.20.2
# chown -R hadoop:hadoop /usr/local/hadoop-data

Step 5 : Edit the config file – /usr/local/hadoop-0.20.2/conf/masters

192.168.0.50 
or 
master-namenode

Step 6 :Edit the config file – /usr/local/hadoop-0.20.2/conf/slaves

datanode-1 
datanode-2
datanode-3
datanode-4

Note : every datanode server must be separate line.

Step 7 : Edit the config file – /usr/local/hadoop-0.20.2/conf/core-site.xml and add the following attribute in configuration element.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/hadoop-data</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs:master-namenode:9000</value>
</property>

Step 8 : Edit the config file - /usr/local/hadoop-0.20.2/conf/hdfs-site.xml and add the following attribute in configuration element.

<property>

    <name>dfs.replication</name>

    <value>4</value>

  </property>

  <property>

    <name>dfs.permissions</name>

    <value>true</value>

  </property>



 <property>

     <name>dfs.safemode.threshold.pct</name>

     <value>0</value>

 </property>


Step 9 : Edit the config file – /usr/local/hadoop-0.20.2/conf/mapred-site.xml and 
add the following lines in configuration element

  <property>

    <name>mapred.job.tracker</name>

    <value>master-namenode:9001</value>

  </property>


repeat the steps from 7-9 in all datanode server too.

Step 10 : make the passwordless ssh from master-namenode to datanode-1 ...4

type the following command for making the passphraseless

 In master-namenode machine, type the following command
 # su – hadoop 
 
# sh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
# ssh-copy-id -i ~/.ssh/id_rsa.pub  hadoop@datanode-1
# ssh-copy-id -i ~/.ssh/id_rsa.pub  hadoop@datanode-2
# ssh-copy-id -i ~/.ssh/id_rsa.pub  hadoop@datanode-3
# ssh-copy-id -i ~/.ssh/id_rsa.pub  hadoop@datanode-4
Step 11 : check the passwordless ssh for all datanode servers from master-namenode to namenode-1..4

ssh namenode-1
ssh namenode-2
ssh namenode-3
ssh namenode-4
if all will be connected without password, then it is fine.

Step 12 : Format the namenode only on master-namenode machine
# cd /usr/local/hadoop-0.20.2/
# bin/hadoop namenode -format
Step 13 : After that start the hadoop only on master-namenode machine
it start namenode,jobtracker and datanode,tasktracker
# bin/start-all.sh
or
# bin/start-dfs.sh    
# bin/start-mapred.sh

Step 13 : After starting the above command on master-namenode, you can check
web interface for namenode as http://master-namenode:50070/
web interface for jobtracher as http://master-namenode:50030/
After starting you can work on it.
For stopping all the datanode,tasktracker, and namenode and jobtracker, you can execute the following command on master
# bin/stop-all.sh
or 
# bin/stop-dfs.sh    
# bin/stop-mapred.sh








 





No comments:

Post a Comment