当前位置: 首页 > 搜索 > 正文

heritrix 3开发实例

1 星2 星3 星4 星5 星 (3 次投票, 评分: 5.00, 总分: 5)
Loading ... Loading ...
baidu_share
文章目录

上一篇文章(heritrix-3.1.1 在windows安装),我们谈到了heritrix在windows下安装,距离实际开发,还需要比较多的工作要做。

一.建立自己的DecideRule,实现url过滤

代码如下:

package com.xq;
 
import org.archive.modules.CrawlURI;
import org.archive.modules.deciderules.DecideResult;
import org.archive.modules.deciderules.DecideRule;
 
public class JobsRule extends DecideRule {
    private static final long serialVersionUID = 1L;
 
    @Override
    protected DecideResult innerDecide(CrawlURI uri) {
        String u = uri.getURI();
        //只抓取http://news.163.com/13/0922/10/网易新闻
        if (u.startsWith("dns")
                || u.startsWith("DNS")
                || u.endsWith("robots.txt")
                || u.endsWith(".html")
                || u.endsWith(".gif")
                || u.endsWith(".jpg")
                || u.endsWith(".jpeg")
                ) {
            if(u.contains("http://news.163.com/13/0922/10/")){
                return DecideResult.ACCEPT;
            }
 
 
        }
        return DecideResult.REJECT;
    }
 
}

二.配置crawler-beans.cxml

2.1找到bean id=”scope”,把刚才的JobsRule添加进去。

 <!-- SCOPE: rules for which discovered URIs to crawl; order is very 
      important because last decision returned other than 'NONE' wins. -->
 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">
  <!-- <property name="logToFile" value="false" /> -->
  <property name="rules">
   <list>
    <!-- Begin by REJECTing all... -->
    <bean class="org.archive.modules.deciderules.RejectDecideRule">
    </bean>
    <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
     <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->
     <!-- <property name="alsoCheckVia" value="false" /> -->
     <!-- <property name="surtsSourceFile" value="" /> -->
     <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->
     <!-- <property name="surtsSource">
           <bean class="org.archive.spring.ConfigString">
            <property name="value">
             <value>
              # example.com
              # http://www.example.edu/path1/
              # +http://(org,example,
             </value>
            </property> 
           </bean>
          </property> -->
		  <property name="surtsSource">
           <bean class="org.archive.spring.ConfigString">
            <property name="value">
             <value>
             +^http://news.163.com/
             </value>
            </property> 
           </bean>
          </property>
    </bean>
    <!-- ...but REJECT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
     <!-- <property name="maxHops" value="20" /> -->
    </bean>
    <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
    <bean class="org.archive.modules.deciderules.TransclusionDecideRule">
     <!-- <property name="maxTransHops" value="2" /> -->
     <!-- <property name="maxSpeculativeHops" value="1" /> -->
    </bean>
	<bean class="com.xq.JobsRule"></bean>
    <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->
    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
          <property name="decision" value="REJECT"/>
          <property name="seedsAsSurtPrefixes" value="false"/>
          <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> 
     <!-- <property name="surtsSource">
           <bean class="org.archive.spring.ConfigFile">
            <property name="path" value="negative-surts.txt" />
           </bean>
          </property> -->
    </bean>
    <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->
    <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
          <property name="decision" value="REJECT"/>
     <!-- <property name="listLogicalOr" value="true" /> -->
     <!-- <property name="regexList">
           <list>
           </list>
          </property> -->
    </bean>
    <!-- ...and REJECT those with suspicious repeating path-segments... -->
    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">
     <!-- <property name="maxRepetitions" value="2" /> -->
    </bean>
    <!-- ...and REJECT those with more than threshold number of path-segments... -->
    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">
     <!-- <property name="maxPathDepth" value="20" /> -->
    </bean>
    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->
    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">
    </bean>
    <!-- ...but always REJECT those with unsupported URI schemes -->
    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">
    </bean>
   </list>
  </property>
 </bean>

2.2找到bean id=”warcWriter”改为class=”org.archive.modules.writer.MirrorWriterProcessor”。

heritrix3默认是org.archive.modules.writer.WARCWriterProcessor,抓取下来的文件为warc.gz。MirrorWriterProcessor就回到heritrix1熟悉的mirror了。

<bean id="warcWriter" class="org.archive.modules.writer.MirrorWriterProcessor">

三.使用jsoup解析网页,得到我们想要的内容

在抓取结束后,heritrix-3目录下有一个mirror目录,即为抓取出来的网页。下来我们使用jsoup解析网页.示例如下:

package com.xq;
 
import java.io.File;
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Whitelist;
 
public class JsoupTest {
	public static void main(String[] args){
		localParse();
 
	}
 
	public static void localParse(){
		try { 
			File input = new File("D://hq//workspace//heritrix-3.1.0-src//mirror//news.163.com//13//0922//10//99CB9PQU0001124J.html");
	        Document doc = Jsoup.parse(input, "gb2312");
                //得到文章title
	        System.out.println(doc.getElementById("h1title").text());
	        //得到文章内容
	        System.out.println(doc.select("div#endText").get(1));
	        System.out.println(doc.getElementById("endText").outerHtml()); 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.basic())); 
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.basicWithImages())); 
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.none()));
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.relaxed()));
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.simpleText()));
	    } catch (Exception e) { 
	        e.printStackTrace(); 
	    } 
	}
 
	public static void neteaseParse(){
		try { 
	        Document doc = Jsoup.connect("http://news.163.com/13/0912/18/98JFMO4R00014JB6.html").get(); 
	        System.out.println(doc.getElementById("h1title").text());
 
	        System.out.println(doc.getElementById("endText").outerHtml()); 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.basic())); 
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.basicWithImages())); 
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.none()));
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.relaxed()));
 
	        System.out.println("\n\n-----------------------------------");
	        System.out.println(Jsoup.clean(doc.getElementById("endText").outerHtml(),Whitelist.simpleText()));
 
 
	        //Elements eles = doc.select("div.artHead"); 
	        //System.out.println(eles.first().select("h3[class=artTitle]")); 
	    } catch (IOException e) { 
	        e.printStackTrace(); 
	    } 
	}
 
 
 
}

利用jsoup解析出我们需要的内容后,就可以入库,建立索引。以上只是heritrix3开发简单demo,在实际应用中还需要很多改进。

本文固定链接: http://www.chepoo.com/heritrix-3-development-examples.html | IT技术精华网

heritrix 3开发实例:等您坐沙发呢!

发表评论