Configure and Install Hadoop on Linux The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop has mainly three parts 1 : Hadoop Common :The common utilities that support the other Hadoop subprojects. For example hbase,hive,cassendra, pig, zookeeper etc. 2 : Hadoop Distributed File System(HDFS) :A distributed file system that provides high-throughput access to application data. 3 : Hadoop MapReduce :A software framework for distributed processing of large data sets on compute clusters.1. NameNode:-Manages the namespace, file system metadata, and access control. There is exactly oneNameNode in each cluster.2.SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.3.JobTracker: - Hands out tasks to the slave nodes. There is exactly one JobTracker in each cluster.4.DataNode: -Holds file system data. Each data node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.5.TaskTracker: - Slaves that carry out map and reduce tasks. There are one or more TaskTrackers in each cluster.InstallationRequired Software :
1 : java-1.6.x must be installed into your system and set the environment path as JAVA_HOME
2 : ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
For windows operating system Cygwin- Required for shell support in windows -------------------------------------------------------------------------- Download the hadoop stable version from this given link http://hadoop.apache.org/common/releases.html and click on download link, and download the stable version into your prefered directory, for example it is downloaded into here /usr/local/hadoop-0.20.2.tar.gz Go to the download directory and untar it with the following command # cd /usr/local/ # tar xvfz hadoop-0.20.2.tar.gz # cd hadoop-0.20.2 set the JAVA_HOME path in conf/hadoop-env.sh file as export JAVA_HOME=/usr/java/jdk1.6.0_18 Now you are ready to start the hadoop cluster into one of three mode Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode Standalone Mode : By Default hadoop is configured to run in non-distributed mode, as a single java process. It is useful for debugging. For running in standalone you can easily test it as follows. # mkdir input # cp conf/*.xml input # bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' # cat output/* Pseudo-Distributed :Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. If you would like to start the hadoop as a custom linux user for example hadoop you have to create group hadoop and user hadoop and give the permission for directories. # goupadd hadoop # useradd -g hadoop hadoop # passwd hadoop # mkdir /usr/local/hadoop-data # chown -R hadoop:hadoop hadoop-data # chown -R hadoop:hadoop hadoop-0.20.2 # su – hadoop # cd /usr/local Step 1 : edit the file conf/core-site.xml and add the following line within the configuration element. <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-data</value> </property>where hadoop-data is directory in /usr/local/, store all file system dataStep 2 : edit the file conf/hdfs-site.xml and add the following line within the configuration element. <property> <name>dfs.replication</name> <value>1</value> </property> Step 3 : edit the file conf/mapred-site.xml and add the following line within the configuration element. <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property>Step 4 : Setup passphraseless ssh If you cannot ssh to localhost without a passphrase, execute the following commands: # ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys or sh-keygen -t dsa -P '' -f ~/.ssh/id_dsa ssh-copy-id -i ~/.ssh/id_rsa.pub slave-machines Step 5 : execute the following command # bin/hadoop namenode -format Step 6 : Start the hadoop daemons: # bin/start-all.sh after starting the above command, you can view the log from HADOOP_HOME/logs directory. The above .sh file started the NameNode and the JobTracker; by default they are available at:NameNode - http://localhost:50070/JobTracker - http://localhost:50030/Fully-Distributed Mode : Suppose we want to set up hadoop in cluster mode we have to identified the machines for namenode and datanodes. For example suppose there are 5 machines in which one machine is namenode and other 4 are datanodes which have name and ip addresses are given here. MachineName IP Address ------------------------------------------------------- master-namenode 192.168.0.50 datanode-1 192.168.0.51 datanode-2 192.168.0.52 datanode-3 192.168.0.53 datanode-3 192.168.0.54 ------------------------------------------------------- The following steps are given below for installing the hadoop in cluster environment. Step 1 : Install the pre-require softwares in all machines. Ie jdk-1.6.x,ssh etc. Step 2 : Download the hadoop from apache site. Suppose it is downloaded into /usr/local/, suppose the downloaded version is hadoop-0.20.2.tar.gz untar it as # tar xvfz hadoop-0.20.2.tar.gz Go to the directory cd /usr/local/ Step 2 : copy the hadoop-0.20.2 directory into /usr/local/ of all machines including master-namenode Step 3 : edit /etc/hosts and add the following ip's into this for all machines192.168.0.50 master-namenode192.168.0.51 datanode-1 192.168.0.52 datanode-2 192.168.0.53 datanode-3 192.168.0.54 datanode-3 Step 4 : create a user named hadoop in all machines and create a directory named hadoop-data in /usr/local/ for data storage of hadoop i.e. execute the following command in all machines # groupadd hadoop # useradd -g hadoop hadoop # chown -R hadoop:hadoop /usr/local/hadoop-0.20.2 # chown -R hadoop:hadoop /usr/local/hadoop-data Step 5 : Edit the config file – /usr/local/hadoop-0.20.2/conf/masters 192.168.0.50 or master-namenode Step 6 :Edit the config file – /usr/local/hadoop-0.20.2/conf/slaves datanode-1 datanode-2 datanode-3 datanode-4 Note : every datanode server must be separate line. Step 7 : Edit the config file – /usr/local/hadoop-0.20.2/conf/core-site.xml and add the following attribute in configuration element. <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-data</value> </property> <property> <name>fs.default.name</name> <value>hdfs:master-namenode:9000</value> </property> Step 8 : Edit the config file - /usr/local/hadoop-0.20.2/conf/hdfs-site.xml and add the following attribute in configuration element. <property> <name>dfs.replication</name> <value>4</value> </property> <property> <name>dfs.permissions</name> <value>true</value> </property> <property> <name>dfs.safemode.threshold.pct</name> <value>0</value> </property> Step 9 : Edit the config file – /usr/local/hadoop-0.20.2/conf/mapred-site.xml and add the following lines in configuration element <property> <name>mapred.job.tracker</name> <value>master-namenode:9001</value> </property> repeat the steps from 7-9 in all datanode server too. Step 10 : make the passwordless ssh from master-namenode to datanode-1 ...4 type the following command for making the passphraseless In master-namenode machine, type the following command # su – hadoop # sh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode-1 # ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode-2 # ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode-3 # ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode-4 Step 11 : check the passwordless ssh for all datanode servers from master-namenode to namenode-1..4 ssh namenode-1 ssh namenode-2 ssh namenode-3 ssh namenode-4 if all will be connected without password, then it is fine. Step 12 : Format the namenode only on master-namenode machine # cd /usr/local/hadoop-0.20.2/ # bin/hadoop namenode -format Step 13 : After that start the hadoop only on master-namenode machine it start namenode,jobtracker and datanode,tasktracker # bin/start-all.sh or # bin/start-dfs.sh # bin/start-mapred.sh Step 13 : After starting the above command on master-namenode, you can check web interface for namenode as http://master-namenode:50070/ web interface for jobtracher as http://master-namenode:50030/ After starting you can work on it. For stopping all the datanode,tasktracker, and namenode and jobtracker, you can execute the following command on master # bin/stop-all.sh or # bin/stop-dfs.sh # bin/stop-mapred.sh
Thursday, 18 April 2013
Install and Configure Hadoop on Linux
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment