当前位置: 首页 > lucene, 搜索 > 正文

分析lucene4查询score计算

1 星2 星3 星4 星5 星 (2 次投票, 评分: 5.00, 总分: 5)
Loading ... Loading ...
baidu_share
文章目录

程序主函数

我们以TermQuery为例,来说一下lucene4计算score的过程。注意,TermQuery查询不会使用coord(q,d)函数。先看一下程序主函数:

public class LuceneScoreTest {
 
	public static void main(String[] args) throws Exception {
		test();
	}
 
	public static void test() throws Exception {
		Directory dir = new RAMDirectory();
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_45,
				new StandardAnalyzer(Version.LUCENE_45));
		IndexWriter writer = new IndexWriter(dir, config);
 
		Document doc1 = new Document();
		Document doc2 = new Document();
		Document doc3 = new Document();
 
		doc1.add(new TextField("bookname", "bc bc", Store.YES));
 
		doc2.add(new TextField("bookname", "ab bc", Store.YES));
 
		doc3.add(new TextField("bookname", "ab bc cd", Store.YES));
 
		writer.addDocument(doc1);
		writer.addDocument(doc2);
		writer.addDocument(doc3);
 
		writer.close();
		IndexReader reader = DirectoryReader.open(dir);
		IndexSearcher searcher = new IndexSearcher(reader);
 
		TermQuery q = new TermQuery(new Term("bookname", "bc"));
 
		TopDocs topdocs = searcher.search(q, 5);
		ScoreDoc[] scoreDocs = topdocs.scoreDocs;
 
		for (int i = 0; i < scoreDocs.length; i++) {
			int doc = scoreDocs[i].doc;
			Document document = searcher.doc(doc);
			System.out.println("bookname====" + document.get("bookname"));
			System.out.println(searcher.explain(q, doc));//
			System.out.println(scoreDocs[i].score);
		}
		reader.close();
	}
}

searcher.search(q, 5)会调用IndexSearcher的search(Query query, int n)方法。

  /** Finds the top <code>n</code>
   * hits for <code>query</code>.
   *
   * @throws BooleanQuery.TooManyClauses If a query would exceed 
   *         {@link BooleanQuery#getMaxClauseCount()} clauses.
   */
  public TopDocs search(Query query, int n)
    throws IOException {
    return search(query, null, n);
  }

继续调用IndexSearcher的search(Query query, Filter filter, int n)方法。

   /** Finds the top <code>n</code>
   * hits for <code>query</code>, applying <code>filter</code> if non-null.
   *
   * @throws BooleanQuery.TooManyClauses If a query would exceed 
   *         {@link BooleanQuery#getMaxClauseCount()} clauses.
   */
  public TopDocs search(Query query, Filter filter, int n)
    throws IOException {
    return search(createNormalizedWeight(wrapFilter(query, filter)), null, n);
  }

注意在这个方法里面createNormalizedWeight(wrapFilter(query, filter))创建权重的规范化。

  /**
   * Creates a normalized weight for a top-level {@link Query}.
   * The query is rewritten by this method and {@link Query#createWeight} called,
   * afterwards the {@link Weight} is normalized. The returned {@code Weight}
   * can then directly be used to get a {@link Scorer}.
   * @lucene.internal
   */
  public Weight createNormalizedWeight(Query query) throws IOException {
    query = rewrite(query);
    Weight weight = query.createWeight(this);
    float v = weight.getValueForNormalization();
    float norm = getSimilarity().queryNorm(v);
    if (Float.isInfinite(norm) || Float.isNaN(norm)) {
      norm = 1.0f;
    }
    weight.normalize(norm, 1.0f);
    return weight;
  }

getValueForNormalization函数分析

在createNormalizedWeight(Query query)方法里面weight.getValueForNormalization()相当于调用TermWeight中的getValueForNormalization()方法。

    public float getValueForNormalization() {
      return stats.getValueForNormalization();
    }

getValueForNormalization()方法中的stats为TFIDFSimilarity.computeWeight()方法得到。其代码如下

    public final SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    final Explanation idf = termStats.length == 1
    ? idfExplain(collectionStats, termStats[0])
    : idfExplain(collectionStats, termStats);
    return new IDFStats(collectionStats.field(), idf, queryBoost);
  }

在上段代码中new IDFStats(collectionStats.field(), idf, queryBoost)会对queryWeight和queryBoost进行初始化。
代码如下:

    public IDFStats(String field, Explanation idf, float queryBoost) {
      // TODO: Validate?
      this.field = field;
      this.idf = idf;
      this.queryBoost = queryBoost;
      this.queryWeight = idf.getValue() * queryBoost; // compute query weight
    }

IDFStats类中getValueForNormalization()方法为:

   public float getValueForNormalization() {
      // TODO: (sorta LUCENE-1907) make non-static class and expose this squaring via a nice method to subclasses?
      return queryWeight * queryWeight;  // sum of squared weights
    }

根据以上分析stats为TFIDFSimilarity中的IDFStats类。createNormalizedWeight(Query query)方法里面weight.getValueForNormalization()相当于调用IDFStats类的getValueForNormalization()方法。getValueForNormalization()方法相当于当前term的idf乘当前idf的权重之积平方。

   public float getValueForNormalization() {
      // TODO: (sorta LUCENE-1907) make non-static class and expose this squaring via a nice method to subclasses?
      return (idf.getValue() * queryBoost) * (idf.getValue() * queryBoost);  // sum of squared weights
    }

queryNorm函数分析

在createNormalizedWeight(Query query)方法里面getSimilarity().queryNorm(v)就是查询的标准查询,使不同查询之间可以比较。此因子不影响文档的排序,因为所有有文档都会使用此因子。也就是执行DefaultSimilarity中queryNorm()方法。参数sumOfSquaredWeights就是getValueForNormalization()值。

   public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

标准化查询权重

在createNormalizedWeight(Query query)方法里面weight.normalize(norm, 1.0f)就相当于调用TFIDFSimilarity中的IDFStats类normalize()方法。

   public void normalize(float queryNorm, float topLevelBoost) {
      this.queryNorm = queryNorm * topLevelBoost;
      queryWeight *= this.queryNorm;              // normalize query weight
      value = queryWeight * idf.getValue();         // idf for document
    }

queryNorm 为以上分析的queryNorm值,topLevelBoost为1.0f.最终value值为:

   value=((idf.getValue() * queryBoost)*(
               (float)(1.0 / Math.sqrt(
                      (idf.getValue() * queryBoost) * (idf.getValue() * queryBoost)))))
                                     *idf.getValue();

score分析计算

执行IndexSearcher的search(Query query, Filter filter, int n)方法,进入执行IndexSearcher的search(Weight weight, ScoreDoc after, int nDocs)方法,然后再进入执行IndexSearcher的448行代码search(leafContexts, weight, after, nDocs);

  /** Expert: Low-level search implementation.  Finds the top <code>n</code>
   * hits for <code>query</code>.
   *
   * <p>Applications should usually call {@link IndexSearcher#search(Query,int)} or
   * {@link IndexSearcher#search(Query,Filter,int)} instead.
   * @throws BooleanQuery.TooManyClauses If a query would exceed 
   *         {@link BooleanQuery#getMaxClauseCount()} clauses.
   */
  protected TopDocs search(List<AtomicReaderContext> leaves, Weight weight, ScoreDoc after, int nDocs) throws IOException {
    // single thread
    int limit = reader.maxDoc();
    if (limit == 0) {
      limit = 1;
    }
    nDocs = Math.min(nDocs, limit);
    TopScoreDocCollector collector = TopScoreDocCollector.create(nDocs, after, !weight.scoresDocsOutOfOrder());
    search(leaves, weight, collector);
    return collector.topDocs();
  }

执行IndexSearcher的491行代码进入search(leaves, weight, collector);

 /**
   * Lower-level search API.
   * 
   * <p>
   * {@link Collector#collect(int)} is called for every document. <br>
   * 
   * <p>
   * NOTE: this method executes the searches on all given leaves exclusively.
   * To search across all the searchers leaves use {@link #leafContexts}.
   * 
   * @param leaves 
   *          the searchers leaves to execute the searches on
   * @param weight
   *          to match documents
   * @param collector
   *          to receive hits
   * @throws BooleanQuery.TooManyClauses If a query would exceed 
   *         {@link BooleanQuery#getMaxClauseCount()} clauses.
   */
  protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {
 
    // TODO: should we make this
    // threaded...?  the Collector could be sync'd?
    // always use single thread:
    for (AtomicReaderContext ctx : leaves) { // search each subreader
      try {
        collector.setNextReader(ctx);
      } catch (CollectionTerminatedException e) {
        // there is no doc of interest in this reader context
        // continue with the following leaf
        continue;
      }
      Scorer scorer = weight.scorer(ctx, !collector.acceptsDocsOutOfOrder(), true, ctx.reader().getLiveDocs());
      if (scorer != null) {
        try {
          scorer.score(collector);
        } catch (CollectionTerminatedException e) {
          // collection was terminated prematurely
          // continue with the following leaf
        }
      }
    }
  }

执行IndexSearcher的627行代码进入scorer.score(collector)进入score计算。scorer为TermScorer,代码如下:

  public float score() throws IOException {
    assert docID() != NO_MORE_DOCS;
    return docScorer.score(docsEnum.docID(), docsEnum.freq());  
  }

TermScorer 的71行代码docScorer.score()方法会代用TFIDFSimilarity中TFIDFSimScorer的score方法。

  public float score(int doc, float freq) {
      final float raw = tf(freq) * weightValue; // compute tf(f)*weight
 
      return norms == null ? raw : raw * decodeNormValue(norms.get(doc));  // normalize for field
    }

decodeNormValue(norms.get(doc))就是上一篇文章分析lucene4建立索引计算norm,建立索引时,保存term的norm值。然后对norm进行解析。

weightValue值就是上面分析到的status.value值.

至此,score值通过程序分析就走通了。

分析explain结果

一下是运行explain函数得到的结果。

0.629606 = (MATCH) weight(bookname:bc in 0) [DefaultSimilarity], result of:
  0.629606 = fieldWeight in 0, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    0.71231794 = idf(docFreq=3, maxDocs=3)
    0.625 = fieldNorm(doc=0)

bookname字段中term :bc 的tf为:tf(freq=2.0)=1.4142135f

bookname字段中term :bc 的idf为:idf(docFreq=3, maxDocs=3)=0.71231794f

bookname字段中term :bc 的queryWeight初始值:idf.getValue() * queryBoost=0.71231794f

bookname字段中term :bc 的sumOfSquaredWeights值为:(idf.getValue() * queryBoost) * (idf.getValue() * queryBoost)=0.71231794*0.71231794=0.5073968476458437f

bookname字段中term :bc 的queryNorm:(float)(1.0 / Math.sqrt(sumOfSquaredWeights))=(float)(1.0 / Math.sqrt(0.71231794*0.71231794))=1.4038675f

bookname字段中term :bc 的weightValue:queryNorm * topLevelBoost*queryWeight初始值*idf=(float)(1.0 / Math.sqrt(0.71231794*0.71231794))*0.71231794f*0.71231794f)

bookname字段中term :bc 的fieldNor:fieldNorm(doc=0)=0.625f;

bookname字段中term :bc 的score:tf(freq) * weightValue*fieldNorm(doc=0)=(float)(1.0 / Math.sqrt(0.71231794*0.71231794))*0.71231794f*0.71231794f*1.4142135f*0.625f)=0.629606f

得到score结果。

本文固定链接: http://www.chepoo.com/analysis-lucene4-query-score-calculated.html | IT技术精华网

分析lucene4查询score计算:等您坐沙发呢!

发表评论