Showing posts with label NoSQL. Show all posts
Showing posts with label NoSQL. Show all posts

Saturday, December 29, 2012

Test Hadoop cluster on vmware

SQL Server MVP Jeremiah Peschka posted 2 articles about Hadoop, which makes me be interested on the nosql skill.

I don't have much knowledge on Nosql and Linux system, so I am going to setup a testing environment on my laptop in holidays

1. download CentOS Linux setup iso file
http://www.centos.org/

2. download java jdk 1.6
http://www.oracle.com/technetwork/java/javase/downloads/index.html

3. download hadoop setup file
http://hadoop.apache.org/#Download+Hadoop

I downloaded release 1.0.4

4. Create VM with VMware workstation
I created 3 vm
linux1 : 192.168.27.29   ----->master

linux2 : 192.168.27.31   ----->slaver
linux3 : 192.168.27.32   ----->slaver


5. install Linux OS

6. Configure vm ip address
vi /etc/sysconfig/network-scripts/ifcfg-eth0

7. Configure host name and hosts file
vi /etc/sysconfig/network          --------->set the hostname
vi /etc/hosts                              --------->add ip hostname mapping for all 3 servers, for instance
192.168.27.29 linux1
192.168.27.31 linux2
192.168.27.32 linux3

8. Install JDK
Copy the jdk install file to vm with vmware share folders, and unzip it to local folder. I installed the jdk in /usr/jdk1.6.0-37

9. Install Hadoop
Copy the install file to vm with vmware share folders, and unzip it to local folder. I installed the hadoop files in /usr/hadoop-1.0.4

10. create folder to Hadoop
temp folder: /usr/hadoop-1.0.4/temp
Data folder: /usr/hadoopfiles/Data
Name folder:/usr/hadoopfiles/Name

make sure the folder owner is the user which will start hadoop thread. and for Data folder and Name folder, the permission should be 755
chmod 755 /usr/hadoopfiles/Data

11. Set environment variable
vi /etc/profile

then add the line below:

HADOOP_HOME=/usr/hadoop-1.0.4
JAVA_HOME=/usr/jdk1.6.0_37
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH
PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export JAVA_HOME
export HADOOP_HOME
export CLASSPATH
export PATH

12. Setup SSH
1) generate ssh pub key file on all 3 servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

run "ssh localhost" to test if ssh works. make sure the authorized_keys file has correct permission, that's important
chmod 644 authorized_keys

2)Copy the file id_dsa.pub to other 2 servers with a new file name, for instance
on linux1, copy the id_dsa.pub to lunix2 and linux3 with name linux1_id_dsa.pub

3) log on other 2 servers, import the new file
cat ~/.ssh/linux1_id_dsa.pub >> ~/.ssh/authorized_keys

do the 3 steps on all 3 servers, make sure you can ssh log on any remote server without password prompt.


13. Configure Hadoop.
1) Open $HADOOP_HOME/conf/hadoop_env.sh, set the line below
export JAVA_HOME=/usr/jdk1.6.0_37

2) Open $HADOOP_HOME/conf/masters, add line below
linux1


3) Open $HADOOP_HOME/conf/slavers, add line below
linux2
linux3


4) Edit $HADOOP_HOME/conf/core-site.xml

<configuration>
<!--- global properties -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-1.0.4/tmp</value>(这里可以自己配置一个存放tmp的文件夹路径)
<description>A base for other temporary directories.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>hdfs://linux1:9000</value>
</property>
</configuration>


5) Edit $HADOOP_HOME/conf/hdfs-site.xml


<configuration>
<!--- global properties -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/HadoopFiles/Name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/HadoopFiles/Data</value>
</property>
</configuration>












6) Edit $HADOOP_HOME/conf/mapred-site.xml



<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>

do the same configuration on all 3 servers

13) disable firewall on all 3 servers
service iptables stop
chkconfig iptables off

14) format name node
cd /usr/hadoop-1.0.4/bin
./hadoop namenode -format

15) start hadoop on master(linux1)
./start-all.sh

16) run "jps" on all 3 servers to check if hadoop is running
or you can open the website below
http://linux1:50030
http://linux1:50070

you can check the log file in logs folder in case any process can not be run.

it is a good start to learn hadoop, even Microsoft is developing data solutions with hadoop on window platform, so it is time to learn new things

reference:
http://blog.csdn.net/skyering/article/details/6457466