当前位置: 首页 > 搜索 > 正文

eclipse中导入nutch 1.7源码

关键字:
1 星2 星3 星4 星5 星 (1 次投票, 评分: 5.00, 总分: 5)
Loading ... Loading ...
baidu_share

接着上一篇文章:cygwin nutch安装实例,现在我们来谈谈在eclipse中导入nutch 1.7源码进行编译,运行Crawl。

1.安装Subclipse插件(SVN客户端)
插件地址:http://subclipse.tigris.org/update_1.8.x

2.安装IvyDE插件(下载依赖Jar)
插件地址:http://www.apache.org/dist/ant/ivyde/updatesite/

3.从svn上签出代码
File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 > URL:https://svn.apache.org/repos/asf/nutch/tags/ > 选中URL > 选择release-1.7,点击next。
nutch1
选择”作为空间中的项目检出”—>Finsh。
nutch2
4.配置构建路径
右键点击该项目,在出现的”Configure”—>”Convert to Faceted Form”
nutch3
在“Project Facets”—>”Java”—>”1.6″ ,然后点击“OK”
nutch4
5.在libraries分页上,右边点击Add Class Floder 选中nutch的conf.
nutch5
6.build.xml—-ant一下

7.修改nutch/conf/nutch-site.xml文件,增加http.agent.name属性。
参见:cygwin nutch安装实例
8.修改nutch/conf/nutch-default.xml文件中的http.agent.name value值设为xq_nutch
参见:cygwin nutch安装实例
9.在nutch-default.xml中修改plugin.folders的值

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

10.在根目录下建一个文件夹urls,文件夹下seed.txt,seed.txt中写要抓取页面的网址

11.修改regex-urlfilter.txt文件
参见:cygwin nutch安装实例

12.build.xml—-ant一下

13。执行
nutch5

在执行的过程中有可能出现以下异常:

java.lang.Exception: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
	... 11 more
Caused by: java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
	at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
	... 16 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
	... 19 more
Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
	at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:123)
	at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:74)
	... 24 more
2013-09-05 20:40:49,329 INFO  mapred.JobClient (JobClient.java:monitorAndPrintJob(1393)) -  map 0% reduce 0%
2013-09-05 20:40:49,332 INFO  mapred.JobClient (JobClient.java:monitorAndPrintJob(1448)) - Job complete: job_local1315110785_0001
2013-09-05 20:40:49,332 INFO  mapred.JobClient (Counters.java:log(585)) - Counters: 0
2013-09-05 20:40:49,333 INFO  mapred.JobClient (JobClient.java:runJob(1356)) - Job Failed: NA
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

这是由于plugin.folders配置错误造成的。改为以下配置就行。

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

本文固定链接: http://www.chepoo.com/importing-nutch-1-7-source-code-eclipse.html | IT技术精华网

【上一篇】
【下一篇】

eclipse中导入nutch 1.7源码:等您坐沙发呢!

发表评论