VSPEX Blue Fundamentals

Basic Definitions
Converged Infrastructure (CI) – delivery and consumption of compute, storage, and connectivity as a complete system
Platform 2 applications – traditional, legacy applications
Platform 3 applications – cloud or scale out applications

Cloud Landscape
1st architecture: Integrated Reference Architecture (IRA) – EMC VSPEX offerings
2nd architecture: Integrated Infrastructure System (IIS) – VCE Vblock
3rd architecture: Hyper-Converged Infrastructure (HCI) – Nutanix, Simplivity, VSPEX Blue
4th architecture: Rackscale/Hyper-Rackscale (RS/HRS) – consumption of a complete rack of converged infrastructure via resource pools
5th architecture: Common Modular Building Blocks – hybrid of IIS and HCI so that storage can scale in a non-linear fashion

EMC CI in 2015
Vblock and VSPEX for Platform 2
VSPEX Blue for ROBO + VDI + SMB
Product pending announcement at EMC World based on the 5th architecture

Hyper-converged Infrastructure Appliance (HCIA)
Integration of compute, storage, and virtualization into a commodity off-the-shelf architecture
IDC expects significant growth in this market due to changing consumption models by consumers

VSPEX Blue
Claims power on to VM creation in 15 minutes
EMC’s value claim – 15 minutes from power on to VM deployment, simple management, linear scalability, single point of support
Powered by VMware EVO:RAIL
Four configurations – 2 standard, 2 performance
Software integrations
RecoverPoint for VMs (for replication)
VMware Data Protection Advanced (for back,up, Avamar and Data Domain under the covers)
EMC CloudArray gateway (for access to cloud storage)
ESRS (for integration with EMC support)
Scales from one to four 2U/4-node appliances

Competition
Nutanix – current market leader
Simplivity
Both have strong partner networks in most geographies

Component 1: EMC VSPEX Blue Hardware
Per standard node specifications:
Processor – Dual Intel Ivy Bridge E5-2620v2 (12 core, 2.1Ghz)
Memory – 128GB
Storage – 32GB SLC SATADOM, 400GB eMLC 2.5″, 3x 1.2TB 10k HDD
Network – 2x 10GbE SFP+ or 10GBase-T RJ-45, 1GbE (RMM)

Per performance node specifications (changes only):
Memory – 192GB

4 appliances – 16 server nodes
3 appliances – 12 server nodes
2 appliances – 8 server nodes
1 appliances – 4 server nodes

Each appliance can support about 100 virtual machines or 250 Horizon View virtual desktops
Appliances are deployed whole (server nodes will always be a multiple of 4)
Each appliance requires 8x 10GbE network ports
All 4 nodes in the appliance must be connected to the 10GbE switch
2x 10GbE ports per node, hence 8x 10GbE for all 4 nodes
Configuring a 10GbE top-of-rack switch is recommended
Enable multicast, IGMP snooping, and IGMP querier

Component 2: VMware EVO:RAIL Software
Browser-agnostic HTML5 GUI
Configures vCenter Server
Configures ESXi hosts and VSAN for all nodes
Configures the network per user input
Default provisioning policies by size of virtual machines and security
Link to the vCenter Web Client is provided in the EVO:RAIL GUI as well
Automatically discovers each appliance and its nodes on the network and requires a single click to confirm/add to an existing cluster
Patches and software updates with no workload downtime via systematic vMotions

Component 3: EMC Software
VSPEX Blue Manager
EMC 24×7 support
EMC RecoverPoint
VDPA and Data Domain
EMC CloudArray

Advertisements

CPUID Flags with Windows 2012 Virtual Machines

I was recently working with a customer who was trying to deploy Windows 2012 R2 virtual machines on a Cisco UCS cluster running ESXi 5.1.1. However, the customer was getting the following error:

“Your PC needs to restart. Please hold down the power button. Error Code: 0x000000C4”

windows2012_error

I went through the process myself and didn’t encounter any issues with the default settings. After running through the process with the customer, I found that he was changing the CPUID Mask from “Expose the NX/XD flag to guest” to “Hide the NX/XD flag from guest” in order to increase vMotion compatibility to other non-UCS ESXi hosts.

windows2012_nxxd

After resetting the flag to “Expose the NX/XD flag to guest”, the error went away, and the Windows 2012 R2 guest OS was able to boot without issue.

Apache Hadoop Deployment

It’s been a while since posting here. I figured I’d start the new year with a new post on what I spent much of the latter part of 2013 researching in my spare time. As a result of my reading on big data and data analytics, I built an Apache Hadoop 2.2.0 cluster in my lab on virtual machines. I chose to go with the vanilla Apache distribution rather than Hortonworks Sandbox or VMware Serengeti as I took the manual process of installation as an opportunity to learn the components and internals of the environment. Below is a compilation of tutorials and my own tinkering as I built the environment from scratch. A lot of this tutorial was gleaned from Michael Noll’s tutorial for deploying Hadoop in Ubuntu with my own adaptation for CentOS.

Installation Pre-requisites:

I built the whole environment on CentOS 6 (i386) as a virtual machine. I chose to create a virtual machine template to keep the process of adding new nodes into the cluster simple. I chose the i386 version for two reasons: 1) I was only using 2GB RAM on my virtual machines and 2) the pre-compiled Apache 2.2.0 distribution was compiled as 32-bit binaries. Yes, I could have compiled from source for 64-bit but I was just trying to keep it simple.

Next are some of the preparation steps I took in the CentOS virtual machine build.

  • Pre-configure DNS resolution for all of the nodes that would be added to the cluster
  • Disable IPv6
/etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
  • Edit /etc/hosts and remove the ::1 entry
  • Reboot the server and check if IPv6 is disabled
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
  • Install and configure Java. Download the latest Java installation and copy it to /var/tmp
cd /var/tmp
tar xvfz jdk-7u45-linux-i586.tar.gz -C /opt

# edit /root/.bashrc
export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin

# set the path link:
# alternatives --install [link] [name] [path] [priority]
[root@hadoop01 bin]# alternatives --install /usr/bin/java java /opt/jdk1.7.0_45/bin/java 2
[root@hadoop01 bin]# alternatives --config java
There is 1 program that provides 'java'.
 Selection    Command -----------------------------------------------
 *+ 1           /opt/jdk1.7.0_45/bin/java
 Enter to keep the current selection[+], or type selection number: 1

# check the Java version: java -version
[root@hadoop01 ~]# java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Client VM (build 24.45-b08, mixed mode)

Hadoop Configuration and Setup:

Now it’s time to actually install and configure the Hadoop components:

  • Create a Hadoop user account and group as root
groupadd hadoop
useradd -g hadoop hadoop
id hadoop
  • Add the hadoop user to /etc/sudoers
hadoop    ALL=(ALL)       ALL
  • Configure key-based login via SSH as the hadoop user
su - hadoop
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

[hadoop@hadoop01 ~]$ ssh-keygen -t rsa 
Generating public/private rsa key pair. 
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'. 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa. 
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. 
The key fingerprint is: c6:97:3a:39:a7:66:8f:9a:9b:bc:f1:0a:8c:29:b4:63 hadoop@hadoop01.nycstorm.lab 
The key's randomart image is: 
+--[ RSA 2048]----+ 
|                 | 
|                 | 
|                 | 
|       .   .     | 
| .      S o      | 
|. .+   . +       | 
|.Eo o . = .      | 
|...  o =o*       | 
|      OB+..      | 
+-----------------+ 
[hadoop@hadoop01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
[hadoop@hadoop01 ~]$ chmod 0600 ~/.ssh/authorized_keys
  • Update the ~/.bashrc script with all the necessary environment variables for the hadoop user:
export JAVA_HOME=/opt/jdk1.7.0_45
export JRE_HOME=/opt/jdk1.7.0_45/jre
export PATH=$PATH:/opt/jdk1.7.0_45/bin:/opt/jdk1.7.0_45/jre/bin
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
  • Optionally export the following setting, which in this case isn’t necessary since we’ve disabled IPv6. If IPv6 is not disabled, the following setting can be used.
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
  • Download the Hadoop files
yum install wget
cd /var/tmp
wget http://apache.mesi.com.ar/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar xvfz hadoop-2.2.0.tar.gz -C /opt
cd /opt
chown -R hadoop:hadoop hadoop-2.2.0
ln -s hadoop-2.2.0 hadoop
  • Update the Hadoop configuration files in the /opt/hadoop/etc/hadoop directory
# core-site.xml
   <property>
     <name>fs.default.name</name>
     <value>hdfs://hadoop01:9000</value>
   </property>

# hdfs-site.xml 
# dfs.replication is the number of replicas of each block 
# dfs.name.dir is the path on the local fs where namenode stores the namespace and transactions persistently 
# dfs.data.dir is the comma-separated list of paths on the local fs of the datanode where it stores its blocks
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>file:///opt/hadoop/data/dfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:///opt/hadoop/data/dfs/data</value>
  </property>

# run the following command to copy from the template to the actual file that we will edit
cp mapred-site.xml.template mapred-site.xml

# mapred-site.xml 
# mapreduce.jobtracker.address for the jobtracker host 
# mapreduce.system.dir where mapreduce stores system/control files 
# mapreduce.local.dir where mapreduce stores temp/intermediate files
  <property>
    <name>mapred.job.tracker</name>
    <value>hadoop01:9001</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/opt/hadoop/data/mapred/system/</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/opt/hadoop/data/mapred/local/</value>
  </property>

# yarn-site.xml
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop01</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop01:8032</value>
  </property>

# slaves
# delete localhost and add all the names of the tasktrackers, each in one line for now just:
hadoop01

Creating the clone:

At this point, the node is ready to be configured as a virtual machine template to be cloned. However, to do so, we need to prepare the OS to be cloned. Perform the following which was taken from this lonesysadmin site. These steps should be completed and then the virtual machine should be shut down.

/usr/bin/yum clean all
/bin/cat /dev/null > /var/log/audit/audit.log
/bin/cat /dev/null > /var/log/wtmp
/bin/rm -f /etc/udev/rules.d/70*
/bin/sed -i '/^\(HWADDR\|UUID\)=/d' /etc/sysconfig/network-scripts/ifcfg-eth0
/bin/rm -f /etc/ssh/*key*
/bin/rm -f /home/hadoop/.ssh/*
/bin/rm –Rf /tmp/*
/bin/rm –Rf /var/tmp/*
/bin/rm -f ~root/.bash_history
unset HISTFILE

However, each time a new clone/node is added to the cluster, a few updates need to be updated. I’ve created a script to initialize a newly cloned node. This script should be created in the hadoop user’s home directory. It can be run from the hadoop user using the following syntax: ~/initialize.sh <fqdn> <ipaddress>

# initialize.sh on new node
#!/bin/bash
HOSTNAME=$1
IPADDR=$2
if [ -z "$HOSTNAME" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
if [ -z "$IPADDR" ]
 then
   echo usage: initialize.sh hostname ipaddr
   exit 1
 fi
sudo /bin/sed -i "s/HOSTNAME=hadoop.nycstorm.lab/HOSTNAME=$HOSTNAME/" /etc/sysconfig/network
sudo /bin/sed -i "s/IPADDR=192.168.11.49/IPADDR=$IPADDR/" /etc/sysconfig/network-scripts/ifcfg-eth0
grep HOSTNAME /etc/sysconfig/network
grep IPADDR /etc/sysconfig/network-scripts/ifcfg-eth0
echo
echo ######################################################################
echo # NOTICE: $HOSTNAME needs reboot now for hostname -f to take effect. #
echo # #
echo # SERVICE: hadoop-daemon.sh start datanode #
echo # SERVICE: yarn-daemon.sh start nodemanager #
echo # #
echo ######################################################################
echo

Also, some updates need to be made to the master node each time a node is added. I’ve created the following script to handle that:

#!/bin/bash
ADDNODE=$1
if [ -z "$ADDNODE" ]
then
  echo usage: addnode.sh hostname
  exit 1
fi
echo $ADDNODE >> /opt/hadoop/etc/hadoop/slaves
scp ~/.ssh/id_rsa.pub hadoop@$ADDNODE:/home/hadoop/id_rsa.pub.hadoop01
ssh hadoop@$ADDNODE "cat id_rsa.pub.hadoop01 >> .ssh/authorized_keys; chmod 644 .ssh/authorized_keys"

Starting Services:

The following are a few commands for starting overall cluster services and for checking status:

# format the hdfs filesystem
/opt/hadoop/bin/hdfs namenode -format
# start the hdfs services
/opt/hadoop/sbin/start-dfs.sh
# start the tasktracker services
/opt/hadoop/sbin/start-yarn.sh
# check the status of all services
jps
# if all starts properly, you will see the following:
 2583 DataNode
 2970 ResourceManager
 3461 Jps
 3177 NodeManager
 2361 NameNode
 2840 SecondaryNameNode

Note that on each individual data node, you can start the datanode and nodemanager services via the following commands:

hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager

From the master node, you can check the status of all nodes:

[hadoop@hadoop01 ~]$ yarn node -list
13/11/18 13:07:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.11.50:8032
Total Nodes:10
Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
hadoop04.nycstorm.lab:52906     RUNNING hadoop04.nycstorm.lab:8042     0
hadoop03.nycstorm.lab:44443     RUNNING hadoop03.nycstorm.lab:8042     0
hadoop08.nycstorm.lab:42321     RUNNING hadoop08.nycstorm.lab:8042     0
hadoop10.nycstorm.lab:53675     RUNNING hadoop10.nycstorm.lab:8042     0
hadoop07.nycstorm.lab:33923     RUNNING hadoop07.nycstorm.lab:8042     0
hadoop01.nycstorm.lab:48101     RUNNING hadoop01.nycstorm.lab:8042     0
hadoop02.nycstorm.lab:60853     RUNNING hadoop02.nycstorm.lab:8042     0
hadoop05.nycstorm.lab:39854     RUNNING hadoop05.nycstorm.lab:8042     0
hadoop09.nycstorm.lab:45020     RUNNING hadoop09.nycstorm.lab:8042     0
hadoop06.nycstorm.lab:35679     RUNNING hadoop06.nycstorm.lab:8042     0

You can then upload some files and take a look at the status of your HDFS filesystem.

[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
[hadoop@hadoop01 data]$ hdfs dfs -copyFromLocal * /data
copyFromLocal: `/data/pg16_peter_pan.txt': File exists
[hadoop@hadoop01 data]$ hdfs dfs -ls /data
Found 14 items
-rw-r--r-- 2 hadoop supergroup 167517 2013-11-26 15:25 /data/pg11_alice_in_wonderland.txt
-rw-r--r-- 2 hadoop supergroup 3322651 2013-11-26 15:25 /data/pg135_les_miserables.txt
-rw-r--r-- 2 hadoop supergroup 284806 2013-11-26 13:02 /data/pg16_peter_pan.txt
-rw-r--r-- 2 hadoop supergroup 1257274 2013-11-26 15:25 /data/pg2701_moby_dick.txt
-rw-r--r-- 2 hadoop supergroup 90701 2013-11-26 15:25 /data/pg41_sleepy_hollow.txt
-rw-r--r-- 2 hadoop supergroup 1573150 2013-11-26 15:25 /data/pg4300_ulysses.txt
-rw-r--r-- 2 hadoop supergroup 181997 2013-11-26 15:25 /data/pg46_a_christmas_carol.txt
-rw-r--r-- 2 hadoop supergroup 1423803 2013-11-26 15:25 /data/pg5000_notes_of_leonardo_davinci.txt
-rw-r--r-- 2 hadoop supergroup 141419 2013-11-26 15:25 /data/pg5200_metamorphosis.txt
-rw-r--r-- 2 hadoop supergroup 421884 2013-11-26 15:25 /data/pg74_adventures_of_tom_sawyer.txt
-rw-r--r-- 2 hadoop supergroup 610157 2013-11-26 15:25 /data/pg76_adventures_of_huckleberry_finn.txt
-rw-r--r-- 2 hadoop supergroup 142382 2013-11-26 15:25 /data/pg844_the_importance_of_being_earnest.txt
-rw-r--r-- 2 hadoop supergroup 448689 2013-11-26 15:25 /data/pg84_frankenstein.txt
-rw-r--r-- 2 hadoop supergroup 641414 2013-11-26 15:25 /data/pg8800_the_divine_comedy.txt
[hadoop@hadoop01 data]$
[hadoop@hadoop01 data]$ hdfs dfsadmin -report
Configured Capacity: 211378749440 (196.86 GB)
Present Capacity: 195689984139 (182.25 GB)
DFS Remaining: 195668267008 (182.23 GB)
DFS Used: 21717131 (20.71 MB)
DFS Used%: 0.01%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Running on Hadoop

I then wrote a mapper.pl and reducer.pl script to do the simple word count example and ran that against the files that I uploaded to HDFS. With those Perl files, I then used the streaming API to run a Hadoop 2.0 job.

hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-mapper /home/hadoop/src/mapper.pl \
-reducer /home/hadoop/src/reducer.pl \
-input /data/pg16_peter_pan.txt -output /output/pg16_peter_pan

Below is the Perl code that I used for the mapper.pl and reducer.pl scripts. That can be adapted to Python or anything else. You can also test the code by running: cat <dir_with_all_files> | mapper.pl | reducer.pl

[hadoop@hadoop01 src]$ cat mapper.pl
 #!/usr/bin/perl
mapper();
sub mapper {
  foreach my $line () {
    chomp($line);
    $line =~ s/[.,;:?!"()\[\]]//g;
    $line =~ s/--/ /g;
    my @words = split(/\s+/, $line);
    foreach $word (@words) {
      print "$word\t1\n";
    }
  }
}
[hadoop@hadoop01 src]$ cat reducer.pl
#!/usr/bin/perl
reducer();
sub reducer {
  my %hash;
  foreach my $line () {
    chomp($line);
    my ($key,$value) = split(/\t/, $line);
    if (defined $hash{$key}) {
      $hash{$key} += $value;
    } else {
      $hash{$key} = 1;
    }
  }
  foreach my $key (keys %hash) {
    print "$key\t$hash{$key}\n";
  }
}

Status

There a few web consoles to look at the status of your Hadoop grid:

Troubleshooting

In the process of building the cluster, I ran into a number of issues. Below are some of the issues and the resolution to those issues.

The first issue I ran into was connectivity between the master node and the data nodes.

2013-11-15 17:24:18,463 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop01/192.168.11.50:9000
2013-11-15 17:24:24,465 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop01/192.168.11.50:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

This was resolved as the iptables firewall on the master node was blocking access from all data nodes. I didn’t spend the time to add the proper firewall rules in. Instead I chose to either disable iptables or delete all the firewall rules. Not secure, but again, this exercise was for the purpose of learning Hadoop, not deploying a production cluster.

iptables --list
iptables --flush (deletes all rules)
/etc/init.d/iptables stop (stops the iptables service)
chkconfig iptables off (disables iptables from starting on boot)

I also had issues writing to HDFS due to SElinux security. Unfortunately I didn’t capture the log entry for that error but it was a write issue to HDFS. I ran the following to rectify that issue.

sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
# force stop selinux without a reboot, alternatively just reboot
setenforce 0

As part of troubleshooting the HDFS write issues above, I ended up reformatting HDFS and causing some issues there. If you ever reformat HDFS, then you need to delete the dfs.data.dirs directory too. You will see the following incompatible clusterIDs error messages in the datanode logs:

0831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
java.io.IOException: Incompatible clusterIDs in /opt/hadoop-2.2.0/data/dfs/data: namenode clusterID = CID-8b249a29-681f-4417-a464-a849d3a9cc9c; datanode clusterID = CID-472489cb-19b3-4381-8572-c9bf7bf5db64
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
        at java.lang.Thread.run(Thread.java:744)
2013-11-26 12:19:36,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683) service to hadoop01/192.168.11.50:9000
2013-11-26 12:19:36,584 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1806045400-192.168.11.50-1385486221792 (storage id DS-1520831241-192.168.11.50-50010-1385486041683)
2013-11-26 12:19:38,584 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-11-26 12:19:38,586 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-11-26 12:19:38,596 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hadoop01.nycstorm.lab/192.168.11.50
************************************************************/

That’s all. At this point you should have a running Hadoop cluster and a job run against the cluster.

PowerPath/Migration Enabler

Introduction:
Performing data migrations is something every systems and storage administrator has had to deal with at some juncture in his/her career. Many migration techniques require downtime for the application. Block-level techniques like SAN Copy (CLARiiON/VNX) or Open Replicator (Symmetrix) require bringing the application down in order to perform the cutover. There are other techniques that do not require downtime. Some examples of this are host-based volume manager or filter driver techniques – Veritas Volume Manager (VxVM), PowerPath/Migration Enabler (PP/ME), or something like Federated Live Migration between two Symmetrix arrays (uses both PowerPath and Open Replicator).

In this blog, I’ll cover some basics and gotchas for PP/ME.

PowerPath Migration Enabler takes advantage of the multi-pathing capabilities within PowerPath. PowerPath functions as a filter driver within the I/O stack on a server. When an application writes down to a volume, it will take the following path down to storage:

  • File system (Windows drive letter or Unix/Linux mounted file system) or Database raw partition
  • Logical volume manager
  • PowerPath (pseudo devices)
  • SCSI driver (native devices)
  • SCSI controller or HBA
  • Storage controller via SAN or direct connection

For a Symmetrix volume, you will typically have two paths for a single device. The server will then have a native device for each of those two paths. From a server perspective, it looks like two different physical devices, but we know that they represent the same device. PowerPath then creates a pseudo device that then encapsulates those native devices as a single device. Note: on a Unix/Linux server you could technically still address a native device, but PowerPath will still intercept the I/O and intelligently load balance as is appropriate.

Why the explanation on PowerPath native and pseudo devices? It’s because PP/ME takes advantage of its position in the I/O stack to copy data to new devices in the background, by copying/mirroring data from one native device to a new native device (similar to VxVM plex mirroring). That activity is transparent to the file system above it. This is why you can perform a migration/cutover to a new storage array while keeping the application online.

The rest of the blog will focus on specifics within a Windows environment, but the same concepts can be applied to Unix/Linux servers.

Pre-requisites:
Below are some of the requirements prior to migrating with PP/ME.

  • PowerPath 5.3.1 at minimum must be installed.
    • If you already have this version or later but did a typical install, you likely won’t have PP/ME installed.
    • To get PP/ME installed, just run the installer again, choose the custom installation option and then select the PP/ME option.
    • In the case where you already had PowerPath installed and are just getting the PP/ME feature added, a reboot is not required but it is recommended. I’ve seen some weird issues with the service without that reboot.
  • You will also require the HostCopy license for PP/ME. It can be obtained for free from EMC, assuming you already have the base PowerPath license and a current maintenance contract.
  • Once all of these are complete, you can verify that you have it by running the powermig command in Command Prompt.

Setting up the Source/Target Mapping:
Next you need to know what the source and destination LUNs should be. The main way would be via the PowerPath Console. In the illustration below, I’ve moved the columns to make it easier to see. You’ll also note that the 3rd device is failed. That was an old CX4 LUN that was reclaimed but not yet cleaned up, which enabled me to at least get a view of both Symmetrix and CLARiiON LUNs in PowerPath.

Disk Number – will correlate to the physical drive in Windows disk management
Device – for Symmetrix devices, it will give you the symdev and for CLARiiON/VNX devices, it will give you the UID.
LUN Name – for Symmetrix devices, it will give you nothing and for CLARiiON/VNX devices, it will give you the name of the LUN. If you have the ALU in the name, then it becomes easy to map.

Host Resources and Throttling:
During the synchronization process, host resources will be used to perform the copy. Depending on what is going on with the server, you may want to throttle the host resources that PP/ME is allowed to use. Below are some of the settings that you can use. The percentage represents the percentage of time the host spends copying data.

  • 0: 100%
  • 1: 60%
  • 2: 36% (default)
  • 3: 22%
  • 4: 13%
  • 5: 7.8%
  • 6: 4.7%
  • 7: 2.8%
  • 8: 1.7%
  • 9: 1.0%

Migration Process:
And now the process. As stated earlier, the migration process does not require bringing the application down. While the migration should still be done in a maintenance window with appropriate notice to the user/business community, the application will take no downtime.

  • Present the new target storage
  • Setup the PP/ME sessions using the mapping technique above
  • Perform initial synchronization prior to the migrations with the appropriate throttle settings
  • During a maintenance window:
    • Switch targets to the new storage and commit the changes
    • Remove the old storage
    • Perform UAT

Below are commands required for setting up, executing and cleaning up PP/ME.

# create a session, will return a handle #
powermig setup -src harddiskXX -tgt harddiskYY -techType hostcopy

# start the background copy, will enter sourceSelected state when complete
powermig sync -handle 1

# monitor status of the session
powermig query -handle 1

# set the throttle per the settings stated above
powermig throttle -throttleValue 0 -handle 1

# switch over to the new storage, still mirrors back to the original
# will enter targetSelected state
powermig selectTarget -handle 1

# stop mirroring to the original, will enter committed state
powermig commit -handle 1

# cleanup/delete the session
powermig cleanup -handle 1

Gotchas:
Below are some gotchas that I’ve seen in my experience with PP/ME:

  • For Windows servers, you need to make sure that the syntax of the source and target devices are “harddiskXX” where XX maps to the PHYSICALDRIVE number, found in PowerPath or Disk Management. The symdev, LUN name, LUN ID, or LUN UID will not work.
  • The synchronization time is largely dependent on host resources and the underlying storage. You could set throttle to 0 (allowing 100% host resources), but if there is a lot going on the box, PP/ME will be competing with other activity to perform the background copies. Hence, you need to find appropriate times to perform the initial synchronization.
  • PP/ME is supported with MSCS clusters. However, it is critical that no resource failovers happen at any point during a PP/ME session. Why? Because once the new LUN is mirrored, it not only has the same data on it, it also has the same disk signature. In the event of a cluster failover, the secondary node will not know about the PP/ME source/target relationship. Therefore, the cluster node will think that the server has two different LUNs with the same disk signature. Because the target LUN is write-disabled (locked by PP/ME on the other node), Windows will then re-signature the source LUN. With a new signature on the source LUN, the cluster will no longer recognize that LUN in the cluster, and therefore the cluster will fail to come back online. You will need to use tools like diskmap and dumpcfg to manually resignature the source LUN back to the original disk signature.

RecoverPoint Initial Synchronization with DD

The Downlow
While some companies may be equipped with an abundance of bandwidth between their production and disaster recovery sites, many others are limited with their site-to-site bandwidth. As such, many implement data replication technologies that also perform data compression, de-duplication, and even fast-write capabilities in IP and fibre channel protocols.

In this particular case study, I’m working with a customer with two data centers, New York and Washington DC with a 50 Mb/s line between the two data centers. EMC RecoverPoint is the replication technology of choice, and the customer is doing bi-directional replication. The Washington DC site has about 4TB of data that needs to replicate to New York, and the New York site had about 10TB of data that needs to replicate to Washington DC.

In a perfect world (no latency, no packet loss, 100% utilization of the link), it would take roughly 7 days to replicate the 4TB and roughly 18 days to replicate the 10TB. That’s almost a month to move the data with the link fully saturated. Unfortunately, that link was also used for other business uses, e.g. VoIP traffic, internal application traffic, server monitoring traffic, etc. Thus the CIO mandated that we find another method to perform the initial synchronization, as using the link (even throttled) was not an option for this duration.

The Approach
The EMC RecoverPoint Release 3.4 Administrator Guide (P/N 300-012-256) documents a method for performing first-time initialization from backup. The primary kicker here though, is that the backup must be a block-level backup, not a file-level backup. This is because the target RecoverPoint image will be seeded with that block-level backup and then RecoverPoint will perform a full volume sweep to synchronize the incremental changes since the block-level backup.

Most companies, however, do not perform block-level backups of their servers. Rather, they perform file-level backups, which then gets catalogued for easy restores. Below is a summary of the process I used to perform the RecoverPoint initial synchronization using dd as the block-level backup.

Pre-requisites/Setup

  • Downloaded dd on a Windows utility server
    • http://www.chrysocome.net/downloads/dd-0.6beta3.zip
    • This is the tool we will use for the block-level backups.
      dd if=[vol_source] of=[vol_target] bs=512k
    • Note: I did some very rudimentary performance tests to see what block size would be optimal for these backups. I found 512k to be the sweet spot.
  • Configured clones for all volumes that will be seeded with RecoverPoint. The main reason for this is two fold:
    • I didn’t want to impact the performance of the production volume while dd reads from the source volume to create the backup.
    • dd cannot operate against volumes with open files. Thus, we’d need to bring down the applications for the duration of the dd backup. When performing a dd against a mounted clone and against the PHYSICALDRIVE address, I did not get open file errors. Below is an example of the errors you will see with dd if there are open files.
      C:\Utilities>dd if=\\.\H: of=z:\testvolume.img bs=512k
      rawwrite dd for windows version 0.6beta3.
      Written by John Newbigin <jn@it.swin.edu.au>
      This program is covered by terms of the GPL Version 2.
      Error opening input file: 32 The process cannot access the file because it is being used by another process
    • The source volumes were on 15k drives and the target volumes were on 7.2k SATA drives. I was able to copy roughly 5 GB/min (+/-0.5) with this process.

The Process 

  1. For new source volumes, confirm that all the data has first been migrated before proceeding.
  2. Configure the consistency group(s) for the volumes in scope
    • When finishing the consistency group, do not start the transfer. Leave the transfer paused.
  3. Right-click the consistency group, select “Clear Markers”
    • This will let RP know that the remote site is known to be identical to its corresponding production volume. Thus a full volume sweep is not required.
    • When the dialog box pops up, select both copies.
    • Note: had to do this via command line because the GUI was only letting me clear the markers in the DR location. The command line without the copy=XYZ option allows you to clear all markers on both sides.
      clear_markers group=RPSyncTest
  4. Create the block-level copy with dd
  5. Transfer the copy to the secondary site. In our case, we shipped the USB drives to the secondary site.
  6. Enable image access on the secondary volume
    • Select the latest image
    • After access goes to logged access, enable direct access
  7. Restore the backup to the secondary volume
    • Remember, you already did the clear markers before you did the first dd copy. If you do it again, it will mess up tracking where the replication should resume.
    • No need to give the drive a drive letter or format it, as you can access it via the \\.\PHYSICALDRIVE2 address.
  8. Disable image access and start the transfer
    • Check the “Start data transfer immediately” checkbox to resume replication
    • Monitor the consistency group. The traffic you see will be the changes to the source volume since the block-level dd copy was made. The duration should be significantly less than if it was a full copy, depending on how much data has changed since the original dd backup.

The Results
Below are some of the results from the initial synchronization process. Note that between the dd on the source and the reverse dd on the secondary volume, roughly two days elapsed.

  • 330GB consistency group 1
    • At 50 Mb/s, it would have taken roughly 15 hours to perform a full sync.
    • Initial synchronization took 58 minutes, transferring roughly 21GB (6.36%).
    • We saved roughly 14 hours and 309GB of transfer.
  • 330GB consistency group 2
    • Initial synchronization took 43 minutes, transferring roughly 10GB (3.03%).
    • We saved a little over 14 hours and 320GB of transfer.
  • 330GB consistency group 3
    • Initial synchronization took 50 minutes, transferring roughly 18GB (5.45%).
    • We saved a little over 14 hours and 312GB of transfer.
  • 330GB consistency group 4
    • Initial synchronization took 53 minutes, transferring roughly 19GB (5.76%).
    • We saved a little over 14 hours and 311GB of transfer.
I would post the initialization graphs of the above consistency groups, but the window for the graphs is 5 minutes and would just show constant transfer. Instead, below are graphs of a 1GB test volume that I configured.
This is a graph of the initialization traffic without the data pre-seeded. Note that the green line for site-to-site traffic hovers between 35-50 Mb/s for almost 2 minutes.
This is a graph of the initialization traffic with the data pre-seeded. Note that the green line spikes for a short duration of time to do the full volume sweeps but lasts for only 30 seconds.

The Commands

plink -l admin -pw admin 192.168.10.10 "enable_group group=RPSyncGroup start_transfer=no"
plink -l admin -pw admin 192.168.10.10 "clear_markers group=RPSyncGroup"
dd if=\\.\[PHYSICALDRIVE##] of=z:\[PHYSICALDRIVE##.img] bs=512k
[transfer the images to the secondary site via USB drive]
plink -l admin -pw admin 192.168.10.10 "enable_image_access group=RPSyncGroup copy=DR_RPSyncGroup image=latest"
plink -l admin -pw admin 192.168.10.10 "set_image_access_mode group=RPSyncGroup copy=DR_RPSyncGroup mode=direct"
dd if=z:\[PHYSICALDRIVE##.img] of=\\.\[PHYSICALDRIVE##] bs=512k
plink -l admin -pw admin 192.168.10.10 "disable_image_access group=RPSyncGroup copy=DR_RPSyncGroup start_transfer=no"
plink -l admin -pw admin 192.168.10.10 "start_transfer group=RPSyncGroup"
[monitor initial synchronization traffic]

Apache on CentOS 6.2 with Sub-directories

I built a CentOS 6.2 virtual machine on my VMware Workstation as a utility server (192.168.1.135). I used the CentOS-6.2-i386-minimal.iso to do the install and then installed a LAMP stack on it. After that, the next step was to get phpMyAdmin to manage the MySQL database. I did the following to do so:

1. Downloaded the latest package from http://www.phpmyadmin.net/home_page/downloads.php onto my laptop (192.168.1.119).
2. Used WinSCP to copy the file to my home directory.
3. Logged in and sudo’ed to root.
4. Copied the file from my home directory to /var/www/html, untarred the package, and renamed the directory to phpmyadmin.
5. I then went to access the server at http://192.168.1.135/phpmyadmin. I then encountered the following 403 error.

Forbidden
You don’t have permission to access /phpmyadmin on this server.

The error logs (/var/log/httpd/error_log) showed the following:

[Thu Apr 19 06:28:22 2012] [error] [client 192.168.1.119] (13)Permission denied: access to /phpmyadmin/ denied

I then sought the counsel of Google. Many web sites talk about either permissions on the directory/files or the httpd.conf configuration. My issue was none of those. It had to do with selinux which apparently comes built into the minimal CentOS 6.2 install.

[root@sandbox conf]# yum list | grep selinux
libselinux.i686 2.0.94-5.2.el6 @anaconda-CentOS-201112130233.i386/6.2
libselinux-utils.i686 2.0.94-5.2.el6 @anaconda-CentOS-201112130233.i386/6.2
selinux-policy.noarch 3.7.19-126.el6 @anaconda-CentOS-201112130233.i386/6.2
selinux-policy-targeted.noarch 3.7.19-126.el6 @anaconda-CentOS-201112130233.i386/6.2
ipa-server-selinux.i686 2.1.3-9.el6 base
libselinux-devel.i686 2.0.94-5.2.el6 base
libselinux-python.i686 2.0.94-5.2.el6 base
libselinux-ruby.i686 2.0.94-5.2.el6 base
libselinux-static.i686 2.0.94-5.2.el6 base
pki-selinux.noarch 9.0.3-21.el6_2 updates
selinux-policy.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-doc.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-minimum.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-mls.noarch 3.7.19-126.el6_2.10 updates
selinux-policy-targeted.noarch 3.7.19-126.el6_2.10 updates

The problem was that the phpmyadmin package that I copied via WinSCP took the wrong context, which therefore didn’t have the appropriate permissions for apache to display.

[root@sandbox html]# ls -Z
-rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 info.php
drwxr-xr-x. root root unconfined_u:object_r:user_tmp_t:s0 phpmyadmin

To fix this, I needed to do the following:

chcon -R -t httpd_sys_content_t phpmyadmin

Note: be sure to use the -R to recursively apply that context against all files. Otherwise you will get a server misconfiguration error.

[root@sandbox html]# ls -Z
-rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 info.php
drwxr-xr-x. root root unconfined_u:object_r:httpd_sys_content_t:s0 phpmyadmin

In retrospect, had I downloaded the file via wget directly into the /var/www/html directory, it would have already taken the proper context, and I would not have had the issue.