Nutch and Hadoop (as user with NFS)
Posted: July 25th, 2007 | Author: Joey | Filed under: Thesis | 4 Comments »This is from http://wiki.apache.org/nutch/NutchHadoopTutorial with some modifications to fit my needs. The article above assumes you have root access, which should be the case if you are going to consume the resources needed to crawl the internet. However, I want to run this as a normal user. Another gotcha is that I am working on a cluster that shares users’ home directories over NFS. For more details or explanation, refer to the original article.
First, I need to be able to login to all the various nodes on the cluster through SSH without being prompted for a password.
ssh-keygen -t rsa #Leave the password empty. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Next, get Nutch built.
mkdir -p {~/src/nutch,~/opt/nutch/build} cd ~/src/nutch svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk . echo dist.dir=$HOME/opt/nutch/build > build.properties ant package
I want mine working with one copy for the crawling, and another copy for searching or just poking around with, without changing the settings for the crawling one.
cd ~/opt/nutch/ cp -r build crawler cp -r build sandbox
~/opt/nutch now contains three directories:
- build – Ant builds Nutch into this directory.
- crawler – The instance used for crawling.
- sandbox – The instance used for searching the indices created by the crawls, or whatever I want to play with.
If I need to rebuild Nutch, I can copy the project from the build directory into the other two. Since my home directory is mounted via NFS, there is no need to log into the other nodes of the cluster and repeat this process. It is done.
Now hadoop needs configured. I am only going to go over the configuration of the crawler instance. The sandbox one will not use hadoop, and therefore is straightforward. I need a place for the logs, the pid files for managing processes, and the filesystem hadoop uses.
cd ~/opt/nutch/crawler mkdir {logs,pids} cd conf vim hadoop-env.sh
I added/modified hadoop-env.sh in the following manner:
export HOSTNAME=`hostname` export HADOOP_HOME=/home/jmazzare/opt/nutch/crawler export JAVA_HOME=/usr/local/java export HADOOP_LOG_DIR=${HADOOP_HOME}/logs/${HOSTNAME} export HADOOP_PID_DIR=${HADOOP_HOME}/pids/${HOSTNAME} export HADOOP_IDENT_STRING=${USER}_${HOSTNAME}
You will, of course, adjust the path to your home directory, and your java home as needed. Now the xml configuration files need set up. I suggest you read this articles on which files should contain which information. http://wiki.apache.org/lucene-hadoop/HowToConfigure. That said, I threw everything into hadoop-site.xml. To keep these examples short, I am only showing one master node (node00) and two slave nodes (node01, node02).
conf/masters
node00
conf/slaves
node01 node02
conf/hadoop-site.xml
<configuration> <property> <name>fs.default.name</name> <value>node00:9000</value> <description> The name of the default file system. Either the literal string "local" or a host:port for NDFS. </description> </property> <property> <name>mapred.job.tracker</name> <value>node00:9001</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.map.tasks</name> <value>2</value> <description> Define mapred.map.tasks to be the number of slave hosts. </description> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> <description> Define mapred.reduce.tasks to be the number of slave hosts. </description> </property> <property> <name>dfs.name.dir</name> <value>/tmp/hadoop-filesystem-jmazzare/name</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hadoop-filesystem-jmazzare/data</value> </property> <property> <name>mapred.system.dir</name> <value>/tmp/hadoop-filesystem-jmazzare/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/tmp/hadoop-filesystem-jmazzare/mapreduce/local</value> </property> <property> <name>dfs.replication</name> <value>2</value> <description> Define dfs.replication to be the number of slave hosts. </description> </property> </configuration>
The filesystem hadoop uses expects to have its own directories on each host to write to, and since I am using NFS, this isn’t good. So I put the filesystem in /tmp since that is mounted separately on each node. I linked to it for convenience.
cd ~/opt/nutch/crawler mkdir /tmp/hadoop-filesystem-`whoami` chmod 700 /tmp/hadoop-filesystem-`whoami` ln -s /tmp/hadoop-filesystem-`whoami` ./filesystem bin/hadoop namenode -format
It seems to create the directories for the log files just fine, but not so much for the pids. So…
cd ~/opt/nutch/crawler/conf for node in $(cat slaves masters); do mkdir ../pids/$node; done; cd .. bin/start-all.sh
The easiest way I have found to verify that everything went correctly is to run the bin/stop-all.sh command. If it complains that there was nothing to stop, then something isn’t configured correctly. If claims to have stopped everything, then all is well. If things don’t seem right, make sure you don’t have any processes that have escaped your attention. Kill those.
When you run Nutch now, it will use the filesystem from Hadoop. So any files that Nutch needs to be aware of need put into Hadoop’s filesystem. I will show the classic example of crawling the apache website.
cd ~/opt/nutch/crawler/ mkdir urls echo 'http://lucene.apache.org' > urls/urllist.txt
This file needs put into the Hadoop filesystem.
# Must be running... bin/start-all.sh cd ~/opt/nutch/crawler bin/hadoop dfs -put urls urls # You can verify with: bin/hadoop dfs -ls bin/hadoop dfs -cat urls/urllist.txt
More references for configuring Nutch can be found here and here show.
In particular, make sure a http.agent.name is set in conf/ and add lucene.apache.org to the “whitelist” in conf/crawl-urlfilter.txt
Now, finally do the crawl.
# Again, make sure it is all running cd ~/opt/nutch/crawler bin/nutch crawl urls -dir crawled -depth 3
You can monitor the output directly, or open a browser and go to port 50030 of the master node. http://node00:50030. You will be able to see the output very easily from there. Check http://node00:50030/machines.jsp and check for failures. If everything is going fine, just wait for it to finish…
After it is done, you can export the data to the sandbox.
cd ~/opt/nutch/crawler bin/hadoop dfs -copyToLocal crawled ../sandbox/crawl bin/stop-all.sh
Now you can do whatever you want with that; point tomcat at it, or query it directly.
cd ~/opt/nutch/sandbox bin/nutch org.apache.nutch.searcher.NutchBean apache
Recent Comments