当前位置: 首页 > 搜索 > 正文

cygwin nutch安装实例

关键字:
1 星2 星3 星4 星5 星 (1 次投票, 评分: 5.00, 总分: 5)
Loading ... Loading ...
baidu_share

1.首先在http://www.cygwin.com/下载安装cygwin。

2.在http://nutch.apache.org/downloads.html下载http://www.apache.org/dyn/closer.cgi/nutch/1.7/apache-nutch-1.7-bin.zip,解压缩。

3.安装jdk,并设置path。注意安装的位置最好设置为d:/java这样的目录下。以免运行nutch会出现cygpath: can’t convert empty path 异常。

4.在cygwin安装目录下\home\Administrator目录中找到.bashrc文件,在里面加入JAVA_HOME路径和LANG。

export JAVA_HOME=D:/Java/jdk1.7.0_25
export LANG='en_US'

5.修改nutch/conf/nutch-site.xml文件,增加http.agent.name属性。

<configuration>
<property>
 <name>http.agent.name</name>
 <value>xq_nutch</value>
</property>
</configuration>

6.修改nutch/conf/nutch-default.xml文件中的http.agent.name value值设为xq_nutch

<property>
  <name>http.agent.name</name>
  <value>xq_nutch</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.
 
  NOTE: You should also check other related properties:
 
	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version
 
  and set their values appropriately.
 
  </description>
</property>

7.在nutch目录下建立urls目录,并建立seed.txt 。在seed.txt 文件中输入内容http://www.163.com/.注意必须以”/”结尾。
seed.txt为抓取网站的地址。

8.修改regex-urlfilter.txt文件

#将+.改为+^http://([a-z0-9]*\.)*163.com/,表示只抓取163的网页。
#+.
+^http://([a-z0-9]*\.)*163.com/

9.在cygwin命令下的nutch目录中运行一下命令。

./bin/nutch crawl urls -dir crawl -depth 3 -topN 5

出现以下结果:
crawl
表示大功告成。

在安装的过程中,可能会出现错误。
1):Exception in thread “main” java.io.IOException: Failed to set permissions of path异常请参考:cygwin nutch Failed to set permissions of path 异常解决

2):Exception in thread “main” java.io.IOException: Job failed!异常请参考:
cygwin nutch java.io.IOException: Job failed异常解决

备注:在linux 下安装nutch更简单,按照以上的配置便可顺利运行。
备注:最好不要在虚拟机下的linux安装nutch,可能会出现以下错误:

Exception in thread "main" java.net.UnknownHostException: xen-47: xen-47
        at java.net.InetAddress.getLocalHost(InetAddress.java:1438)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:960)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:141)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: java.net.UnknownHostException: xen-47
        at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
        at java.net.InetAddress.getLocalHost(InetAddress.java:1434)
        ... 12 more

参考文章:http://wiki.apache.org/nutch/NutchTutorial

本文固定链接: http://www.chepoo.com/cygwin-nutch-installations.html | IT技术精华网

cygwin nutch安装实例:目前有1 条留言

  1. 沙发
    :

    Your weblog is 1 of a kind, i really like the way you organize the topics.

    [回复]

发表评论