Apache Hadoop Deployment

It’s been a while since posting here. I figured I’d start the new year with a new post on what I spent much of the latter part of 2013 researching in my spare time. As a result of my reading on big data and data analytics, I built an Apache Hadoop 2.2.0 cluster in my lab on virtual machines. I chose to go with the vanilla Apache distribution rather than Hortonworks Sandbox or VMware Serengeti as I took the manual process of installation as an opportunity to learn the components and internals of the environment. Below is a compilation of tutorials and my own tinkering as I built the environment from scratch. A lot of this tutorial was gleaned from Michael Noll’s tutorial for deploying Hadoop in Ubuntu with my own adaptation for CentOS.

Installation Pre-requisites:

I built the whole environment on CentOS 6 (i386) as a virtual machine. I chose to create a virtual machine template to keep the process of adding new nodes into the cluster simple. I chose the i386 version for two reasons: 1) I was only using 2GB RAM on my virtual machines and 2) the pre-compiled Apache 2.2.0 distribution was compiled as 32-bit binaries. Yes, I could have compiled from source for 64-bit but I was just trying to keep it simple.

Next are some of the preparation steps I took in the CentOS virtual machine build.

  • Pre-configure DNS resolution for all of the nodes that would be added to the cluster
  • Disable IPv6
/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
  • Edit /etc/hosts and remove the ::1 entry
  • Reboot the server and check if IPv6 is disabled
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
  • Install and configure Java. Download the latest Java installation and copy it to /var/tmp
cd /var/tmp
tar xvfz jdk-7u45-linux-i586.tar.gz -C /opt

# edit /root/.bashrc
export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin

# set the path link:
# alternatives --install [link] [name] [path] [priority]
[root@hadoop01 bin]# alternatives --install /usr/bin/java java /opt/jdk1.7.0_45/bin/java 2
[root@hadoop01 bin]# alternatives --config java
There is 1 program that provides 'java'.
 Selection    Command -----------------------------------------------
 *+ 1           /opt/jdk1.7.0_45/bin/java
 Enter to keep the current selection[+], or type selection number: 1

# check the Java version: java -version
[root@hadoop01 ~]# java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Client VM (build 24.45-b08, mixed mode)

Hadoop Configuration and Setup:

Now it’s time to actually install and configure the Hadoop components:

  • Create a Hadoop user account and group as root
groupadd hadoop
useradd -g hadoop hadoop
id hadoop
  • Add the hadoop user to /etc/sudoers
hadoop    ALL=(ALL)       ALL
  • Configure key-based login via SSH as the hadoop user
su - hadoop
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

[hadoop@hadoop01 ~]$ ssh-keygen -t rsa 
Generating public/private rsa key pair. 
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'. 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa. 
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. 
The key fingerprint is: c6:97:3a:39:a7:66:8f:9a:9b:bc:f1:0a:8c:29:b4:63 hadoop@hadoop01.nycstorm.lab 
The key's randomart image is: 
+--[ RSA 2048]----+ 
|                 | 
|                 | 
|                 | 
|       .   .     | 
| .      S o      | 
|. .+   . +       | 
|.Eo o . = .      | 
|...  o =o*       | 
|      OB+..      | 
+-----------------+ 
[hadoop@hadoop01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
[hadoop@hadoop01 ~]$ chmod 0600 ~/.ssh/authorized_keys
  • Update the ~/.bashrc script with all the necessary environment variables for the hadoop user:
export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
  • Optionally export the following setting, which in this case isn’t necessary since we’ve disabled IPv6. If IPv6 is not disabled, the following setting can be used.
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
  • Download the Hadoop files
yum install wget
cd /var/tmp
wget http://apache.mesi.com.ar/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar xvfz hadoop-2.2.0.tar.gz -C /opt
cd /opt
chown -R hadoop:hadoop hadoop-2.2.0
ln -s hadoop-2.2.0 hadoop
  • Update the Hadoop configuration files in the /opt/hadoop/etc/hadoop directory
# core-site.xml
   <property>
     <name>fs.default.name</name>
     <value>hdfs://hadoop01:9000</value>
   </property>

# hdfs-site.xml 
# dfs.replication is the number of replicas of each block 
# dfs.name.dir is the path on the local fs where namenode stores the namespace and transactions persistently 
# dfs.data.dir is the comma-separated list of paths on the local fs of the datanode where it stores its blocks
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>file:///opt/hadoop/data/dfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:///opt/hadoop/data/dfs/data</value>
  </property>

# run the following command to copy from the template to the actual file that we will edit
cp mapred-site.xml.template mapred-site.xml

# mapred-site.xml 
# mapreduce.jobtracker.address for the jobtracker host 
# mapreduce.system.dir where mapreduce stores system/control files 
# mapreduce.local.dir where mapreduce stores temp/intermediate files
  <property>
    <name>mapred.job.tracker</name>
    <value>hadoop01:9001</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/opt/hadoop/data/mapred/system/</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/opt/hadoop/data/mapred/local/</value>
  </property>

# yarn-site.xml
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop01</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop01:8032</value>
  </property>

# slaves
# delete localhost and add all the names of the tasktrackers, each in one line for now just:
hadoop01

Creating the clone:

At this point, the node is ready to be configured as a virtual machine template to be cloned. However, to do so, we need to prepare the OS to be cloned. Perform the following which was taken from this lonesysadmin site. These steps should be completed and then the virtual machine should be shut down.

/usr/bin/yum clean all
/bin/cat /dev/null > /var/log/audit/audit.log
/bin/cat /dev/null > /var/log/wtmp
/bin/rm -f /etc/udev/rules.d/70*
/bin/sed -i '/^\(HWADDR\|UUID\)=/d' /etc/sysconfig/network-scripts/ifcfg-eth0
/bin/rm -f /etc/ssh/*key*
/bin/rm -f /home/hadoop/.ssh/*
/bin/rm –Rf /tmp/*
/bin/rm –Rf /var/tmp/*
/bin/rm -f ~root/.bash_history
unset HISTFILE

However, each time a new clone/node is added to the cluster, a few updates need to be updated. I’ve created a script to initialize a newly cloned node. This script should be created in the hadoop user’s home directory. It can be run from the hadoop user using the following syntax: ~/initialize.sh <fqdn> <ipaddress>

# initialize.sh on new node
#!/bin/bash
HOSTNAME=$1
IPADDR=$2
if [ -z "$HOSTNAME" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
if [ -z "$IPADDR" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
sudo /bin/sed -i "s/HOSTNAME=hadoop.nycstorm.lab/HOSTNAME=$HOSTNAME/" /etc/sysconfig/network
sudo /bin/sed -i "s/IPADDR=192.168.11.49/IPADDR=$IPADDR/" /etc/sysconfig/network-scripts/ifcfg-eth0
grep HOSTNAME /etc/sysconfig/network
grep IPADDR /etc/sysconfig/network-scripts/ifcfg-eth0
echo
echo ######################################################################
echo # NOTICE: $HOSTNAME needs reboot now for hostname -f to take effect. #
echo # #
echo # SERVICE: hadoop-daemon.sh start datanode #
echo # SERVICE: yarn-daemon.sh start nodemanager #
echo # #
echo ######################################################################
echo

Also, some updates need to be made to the master node each time a node is added. I’ve created the following script to handle that:

#!/bin/bash
ADDNODE=$1
if [ -z "$ADDNODE" ]
then
  echo usage: addnode.sh hostname
  exit 1
fi
echo $ADDNODE >> /opt/hadoop/etc/hadoop/slaves
scp ~/.ssh/id_rsa.pub hadoop@$ADDNODE:/home/hadoop/id_rsa.pub.hadoop01
ssh hadoop@$ADDNODE "cat id_rsa.pub.hadoop01 >> .ssh/authorized_keys; chmod 644 .ssh/authorized_keys"

Starting Services:

The following are a few commands for starting overall cluster services and for checking status:

# format the hdfs filesystem
/opt/hadoop/bin/hdfs namenode -format
# start the hdfs services
/opt/hadoop/sbin/start-dfs.sh
# start the tasktracker services
/opt/hadoop/sbin/start-yarn.sh
# check the status of all services
jps
# if all starts properly, you will see the following:
 2583 DataNode
 2970 ResourceManager
 3461 Jps
 3177 NodeManager
 2361 NameNode
 2840 SecondaryNameNode

Note that on each individual data node, you can start the datanode and nodemanager services via the following commands:

hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager

From the master node, you can check the status of all nodes:

[hadoop@hadoop01 ~]$ yarn node -list
13/11/18 13:07:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.11.50:8032
Total Nodes:10
Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
hadoop04.nycstorm.lab:52906     RUNNING hadoop04.nycstorm.lab:8042     0
hadoop03.nycstorm.lab:44443     RUNNING hadoop03.nycstorm.lab:8042     0
hadoop08.nycstorm.lab:42321     RUNNING hadoop08.nycstorm.lab:8042     0
hadoop10.nycstorm.lab:53675     RUNNING hadoop10.nycstorm.lab:8042     0
hadoop07.nycstorm.lab:33923     RUNNING hadoop07.nycstorm.lab:8042     0
hadoop01.nycstorm.lab:48101     RUNNING hadoop01.nycstorm.lab:8042     0
hadoop02.nycstorm.lab:60853     RUNNING hadoop02.nycstorm.lab:8042     0
hadoop05.nycstorm.lab:39854     RUNNING hadoop05.nycstorm.lab:8042     0
hadoop09.nycstorm.lab:45020     RUNNING hadoop09.nycstorm.lab:8042     0
hadoop06.nycstorm.lab:35679     RUNNING hadoop06.nycstorm.lab:8042     0

You can then upload some files and take a look at the status of your HDFS filesystem.

[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
[hadoop@hadoop01 data]$ hdfs dfs -copyFromLocal * /data
copyFromLocal: `/data/pg16_peter_pan.txt': File exists
[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 14 items
-rw-r--r-- 2 hadoop supergroup 167517 2013-11-26 15:25 /data/pg11_alice_in_wonderland.txt
-rw-r--r-- 2 hadoop supergroup 3322651 2013-11-26 15:25 /data/pg135_les_miserables.txt
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
-rw-r--r-- 2 hadoop supergroup 1257274 2013-11-26 15:25 /data/pg2701_moby_dick.txt
-rw-r--r-- 2 hadoop supergroup 90701 2013-11-26 15:25 /data/pg41_sleepy_hollow.txt
-rw-r--r-- 2 hadoop supergroup 1573150 2013-11-26 15:25 /data/pg4300_ulysses.txt
-rw-r--r-- 2 hadoop supergroup 181997 2013-11-26 15:25 /data/pg46_a_christmas_carol.txt
-rw-r--r-- 2 hadoop supergroup 1423803 2013-11-26 15:25 /data/pg5000_notes_of_leonardo_davinci.txt
-rw-r--r-- 2 hadoop supergroup 141419 2013-11-26 15:25 /data/pg5200_metamorphosis.txt
-rw-r--r-- 2 hadoop supergroup 421884 2013-11-26 15:25 /data/pg74_adventures_of_tom_sawyer.txt
-rw-r--r-- 2 hadoop supergroup 610157 2013-11-26 15:25 /data/pg76_adventures_of_huckleberry_finn.txt
-rw-r--r-- 2 hadoop supergroup 142382 2013-11-26 15:25 /data/pg844_the_importance_of_being_earnest.txt
-rw-r--r-- 2 hadoop supergroup 448689 2013-11-26 15:25 /data/pg84_frankenstein.txt
-rw-r--r-- 2 hadoop supergroup 641414 2013-11-26 15:25 /data/pg8800_the_divine_comedy.txt
[hadoop@hadoop01 data]$
[hadoop@hadoop01 data]$ hdfs dfsadmin -report
Configured Capacity: 211378749440 (196.86 GB)
Present Capacity: 195689984139 (182.25 GB)
DFS Remaining: 195668267008 (182.23 GB)
DFS Used: 21717131 (20.71 MB)
DFS Used%: 0.01%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Running on Hadoop

I then wrote a mapper.pl and reducer.pl script to do the simple word count example and ran that against the files that I uploaded to HDFS. With those Perl files, I then used the streaming API to run a Hadoop 2.0 job.

hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-mapper /home/hadoop/src/mapper.pl \
-reducer /home/hadoop/src/reducer.pl \
-input /data/pg16_peter_pan.txt -output /output/pg16_peter_pan

Below is the Perl code that I used for the mapper.pl and reducer.pl scripts. That can be adapted to Python or anything else. You can also test the code by running: cat <dir_with_all_files> | mapper.pl | reducer.pl

[hadoop@hadoop01 src]$ cat mapper.pl
 #!/usr/bin/perl
mapper();
sub mapper {
  foreach my $line () {
    chomp($line);
    $line =~ s/[.,;:?!"()\[\]]//g;
    $line =~ s/--/ /g;
    my @words = split(/\s+/, $line);
    foreach $word (@words) {
      print "$word\t1\n";
    }
  }
}
[hadoop@hadoop01 src]$ cat reducer.pl
#!/usr/bin/perl
reducer();
sub reducer {
  my %hash;
  foreach my $line () {
    chomp($line);
    my ($key,$value) = split(/\t/, $line);
    if (defined $hash{$key}) {
      $hash{$key} += $value;
    } else {
      $hash{$key} = 1;
    }
  }
  foreach my $key (keys %hash) {
    print "$key\t$hash{$key}\n";
  }
}

Status

There a few web consoles to look at the status of your Hadoop grid:

Troubleshooting

In the process of building the cluster, I ran into a number of issues. Below are some of the issues and the resolution to those issues.

The first issue I ran into was connectivity between the master node and the data nodes.

2013-11-15 17:24:18,463 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop01/192.168.11.50:9000
2013-11-15 17:24:24,465 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop01/192.168.11.50:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

This was resolved as the iptables firewall on the master node was blocking access from all data nodes. I didn’t spend the time to add the proper firewall rules in. Instead I chose to either disable iptables or delete all the firewall rules. Not secure, but again, this exercise was for the purpose of learning Hadoop, not deploying a production cluster.

iptables --list
iptables --flush (deletes all rules)
/etc/init.d/iptables stop (stops the iptables service)
chkconfig iptables off (disables iptables from starting on boot)

I also had issues writing to HDFS due to SElinux security. Unfortunately I didn’t capture the log entry for that error but it was a write issue to HDFS. I ran the following to rectify that issue.

sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
# force stop selinux without a reboot, alternatively just reboot
setenforce 0

As part of troubleshooting the HDFS write issues above, I ended up reformatting HDFS and causing some issues there. If you ever reformat HDFS, then you need to delete the dfs.data.dirs directory too. You will see the following incompatible clusterIDs error messages in the datanode logs:

0831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
java.io.IOException: Incompatible clusterIDs in /opt/hadoop-2.2.0/data/dfs/data: namenode clusterID = CID-8b249a29-681f-4417-a464-a849d3a9cc9c; datanode clusterID = CID-472489cb-19b3-4381-8572-c9bf7bf5db64
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
        at java.lang.Thread.run(Thread.java:744)
2013-11-26 12:19:36,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
2013-11-26 12:19:36,584 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683)
2013-11-26 12:19:38,584 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-11-26 12:19:38,586 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-11-26 12:19:38,596 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hadoop01.nycstorm.lab/192.168.11.50
************************************************************/

That’s all. At this point you should have a running Hadoop cluster and a job run against the cluster.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s