Saturday, December 29, 2012

Test Hadoop cluster on vmware

SQL Server MVP Jeremiah Peschka posted 2 articles about Hadoop, which makes me be interested on the nosql skill.

I don't have much knowledge on Nosql and Linux system, so I am going to setup a testing environment on my laptop in holidays

1. download CentOS Linux setup iso file

2. download java jdk 1.6

3. download hadoop setup file

I downloaded release 1.0.4

4. Create VM with VMware workstation
I created 3 vm
linux1 :   ----->master

linux2 :   ----->slaver
linux3 :   ----->slaver

5. install Linux OS

6. Configure vm ip address
vi /etc/sysconfig/network-scripts/ifcfg-eth0

7. Configure host name and hosts file
vi /etc/sysconfig/network          --------->set the hostname
vi /etc/hosts                              --------->add ip hostname mapping for all 3 servers, for instance linux1 linux2 linux3

8. Install JDK
Copy the jdk install file to vm with vmware share folders, and unzip it to local folder. I installed the jdk in /usr/jdk1.6.0-37

9. Install Hadoop
Copy the install file to vm with vmware share folders, and unzip it to local folder. I installed the hadoop files in /usr/hadoop-1.0.4

10. create folder to Hadoop
temp folder: /usr/hadoop-1.0.4/temp
Data folder: /usr/hadoopfiles/Data
Name folder:/usr/hadoopfiles/Name

make sure the folder owner is the user which will start hadoop thread. and for Data folder and Name folder, the permission should be 755
chmod 755 /usr/hadoopfiles/Data

11. Set environment variable
vi /etc/profile

then add the line below:

export JAVA_HOME
export PATH

12. Setup SSH
1) generate ssh pub key file on all 3 servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys

run "ssh localhost" to test if ssh works. make sure the authorized_keys file has correct permission, that's important
chmod 644 authorized_keys

2)Copy the file to other 2 servers with a new file name, for instance
on linux1, copy the to lunix2 and linux3 with name

3) log on other 2 servers, import the new file
cat ~/.ssh/ >> ~/.ssh/authorized_keys

do the 3 steps on all 3 servers, make sure you can ssh log on any remote server without password prompt.

13. Configure Hadoop.
1) Open $HADOOP_HOME/conf/, set the line below
export JAVA_HOME=/usr/jdk1.6.0_37

2) Open $HADOOP_HOME/conf/masters, add line below

3) Open $HADOOP_HOME/conf/slavers, add line below

4) Edit $HADOOP_HOME/conf/core-site.xml

<!--- global properties -->
<description>A base for other temporary directories.</description>
<!-- file system properties -->

5) Edit $HADOOP_HOME/conf/hdfs-site.xml

<!--- global properties -->

6) Edit $HADOOP_HOME/conf/mapred-site.xml


do the same configuration on all 3 servers

13) disable firewall on all 3 servers
service iptables stop
chkconfig iptables off

14) format name node
cd /usr/hadoop-1.0.4/bin
./hadoop namenode -format

15) start hadoop on master(linux1)

16) run "jps" on all 3 servers to check if hadoop is running
or you can open the website below

you can check the log file in logs folder in case any process can not be run.

it is a good start to learn hadoop, even Microsoft is developing data solutions with hadoop on window platform, so it is time to learn new things