Saturday, December 29, 2012

Test Hadoop cluster on vmware

SQL Server MVP Jeremiah Peschka posted 2 articles about Hadoop, which makes me be interested on the nosql skill.

I don't have much knowledge on Nosql and Linux system, so I am going to setup a testing environment on my laptop in holidays

1. download CentOS Linux setup iso file
http://www.centos.org/

2. download java jdk 1.6
http://www.oracle.com/technetwork/java/javase/downloads/index.html

3. download hadoop setup file
http://hadoop.apache.org/#Download+Hadoop

I downloaded release 1.0.4

4. Create VM with VMware workstation
I created 3 vm
linux1 : 192.168.27.29   ----->master

linux2 : 192.168.27.31   ----->slaver
linux3 : 192.168.27.32   ----->slaver


5. install Linux OS

6. Configure vm ip address
vi /etc/sysconfig/network-scripts/ifcfg-eth0

7. Configure host name and hosts file
vi /etc/sysconfig/network          --------->set the hostname
vi /etc/hosts                              --------->add ip hostname mapping for all 3 servers, for instance
192.168.27.29 linux1
192.168.27.31 linux2
192.168.27.32 linux3

8. Install JDK
Copy the jdk install file to vm with vmware share folders, and unzip it to local folder. I installed the jdk in /usr/jdk1.6.0-37

9. Install Hadoop
Copy the install file to vm with vmware share folders, and unzip it to local folder. I installed the hadoop files in /usr/hadoop-1.0.4

10. create folder to Hadoop
temp folder: /usr/hadoop-1.0.4/temp
Data folder: /usr/hadoopfiles/Data
Name folder:/usr/hadoopfiles/Name

make sure the folder owner is the user which will start hadoop thread. and for Data folder and Name folder, the permission should be 755
chmod 755 /usr/hadoopfiles/Data

11. Set environment variable
vi /etc/profile

then add the line below:

HADOOP_HOME=/usr/hadoop-1.0.4
JAVA_HOME=/usr/jdk1.6.0_37
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export JAVA_HOME
export HADOOP_HOME
export CLASSPATH
export PATH

12. Setup SSH
1) generate ssh pub key file on all 3 servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

run "ssh localhost" to test if ssh works. make sure the authorized_keys file has correct permission, that's important
chmod 644 authorized_keys

2)Copy the file id_dsa.pub to other 2 servers with a new file name, for instance
on linux1, copy the id_dsa.pub to lunix2 and linux3 with name linux1_id_dsa.pub

3) log on other 2 servers, import the new file
cat ~/.ssh/linux1_id_dsa.pub >> ~/.ssh/authorized_keys

do the 3 steps on all 3 servers, make sure you can ssh log on any remote server without password prompt.


13. Configure Hadoop.
1) Open $HADOOP_HOME/conf/hadoop_env.sh, set the line below
export JAVA_HOME=/usr/jdk1.6.0_37

2) Open $HADOOP_HOME/conf/masters, add line below
linux1


3) Open $HADOOP_HOME/conf/slavers, add line below
linux2
linux3


4) Edit $HADOOP_HOME/conf/core-site.xml

<configuration>
<!--- global properties -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-1.0.4/tmp</value>(这里可以自己配置一个存放tmp的文件夹路径)
<description>A base for other temporary directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://linux1:9000</value>
</property>
</configuration>


5) Edit $HADOOP_HOME/conf/hdfs-site.xml


<configuration>
<!--- global properties -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/HadoopFiles/Name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/HadoopFiles/Data</value>
</property>
</configuration>












6) Edit $HADOOP_HOME/conf/mapred-site.xml



<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>

do the same configuration on all 3 servers

13) disable firewall on all 3 servers
service iptables stop
chkconfig iptables off

14) format name node
cd /usr/hadoop-1.0.4/bin
./hadoop namenode -format

15) start hadoop on master(linux1)
./start-all.sh

16) run "jps" on all 3 servers to check if hadoop is running
or you can open the website below
http://linux1:50030
http://linux1:50070

you can check the log file in logs folder in case any process can not be run.

it is a good start to learn hadoop, even Microsoft is developing data solutions with hadoop on window platform, so it is time to learn new things

reference:
http://blog.csdn.net/skyering/article/details/6457466