计算Lucene中的匹配词 [英] Count matching terms in Lucene

查看:94
本文介绍了计算Lucene中的匹配词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lucene_35搜索一个字段.我想从我的学期中找到与该领域相匹配的单词. 例如,我的字段是"JavaServer Faces(JSF)是基于Java的Web应用程序框架,旨在简化基于Web的用户界面的开发集成." ,我的查询词是" java/jsf/framework/doesnotexist"并想要结果3,因为仅存在" java"" jsf"" framework"在该领域. 这是我关注的一个简单示例:

I am searching in a field with Lucene_35. I would like to get how many words from my term match the field. For example my field is "JavaServer Faces (JSF) is a Java-based Web application framework intended to simplify development integration of web-based user interfaces.", my query term is "java/jsf/framework/doesnotexist" and want result 3 since only "java", "jsf" and "framework" are present in the field. Here is a simple example I am following:

 public void explain(String document, String queryExpr) throws Exception {

        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
        Directory index = new RAMDirectory();
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
        IndexWriter w = new IndexWriter(index, config);
        addDoc(w, document);
        w.close();
        String queryExpression = queryExpr;
        Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(queryExpression);

        System.out.println("Query: " + queryExpression);
        IndexReader reader = IndexReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs topDocs = searcher.search(q, 10);
        for (int i = 0; i < topDocs.totalHits; i++) {
            ScoreDoc match = topDocs.scoreDocs[i];
            System.out.println("match.score: " + match.score);
            Explanation explanation = searcher.explain(q, match.doc); //#1
            System.out.println("----------");
            Document doc = searcher.doc(match.doc);
            System.out.println(doc.get("title"));
            System.out.println(explanation.toString());
        }
        searcher.close();
    }

具有上述参数的输出为:

The output with the above mentioned parameters is:

 0.021505041 = (MATCH) product of:
  0.028673388 = (MATCH) sum of:
    0.0064956956 = (MATCH) weight(title:java in 0), product of:
      0.2709602 = queryWeight(title:java), product of:
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.8830299 = queryNorm

....

     0.033902764 = (MATCH) fieldWeight(title:framework in 0), product of:
        1.4142135 = tf(termFreq(title:framework)=2)
        0.30685282 = idf(docFreq=1, maxDocs=1)
        0.078125 = fieldNorm(field=title, doc=0)
  0.75 = coord(3/4)

我想得到这个3/4.

致谢!

推荐答案

您可以通过使用以下方法定义覆盖Lucene的DefaultSimilarity来实现此目的:

You can achieve this by overriding Lucene's DefaultSimilarity with the following method definitions:

  • computeNorm(field,state)-> state.getBoost()
  • tf(freq)-> freq == 0? 0:1
  • idf(docFreq,numDocs)-> 1
  • coord(overlap,maxOverlap)-> 1/maxOverlap
  • queryNorm(sumOfQuareWeights)-> 1

通过这种方式,文档的最终分数以乘数(1/maxOverlap)乘以匹配项数结束.

This way, the final score of a document ends being the coor factor (1 / maxOverlap) times the number of matching terms.

Directory dir = new RAMDirectory();

Similarity similarity = new DefaultSimilarity() {
  @Override
  public float computeNorm(String fld, FieldInvertState state) {
    return state.getBoost();
  }

  @Override
  public float coord(int overlap, int maxOverlap) {
    return 1f / maxOverlap;
  }

  @Override
  public float idf(int docFreq, int numDocs) {
    return 1f;
  }

  @Override
  public float queryNorm(float sumOfSquaredWeights) {
    return 1f;
  }

  @Override
  public float tf(float freq) {
    return freq == 0f ? 0f : 1f;
  }
};
IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_35,
    new WhitespaceAnalyzer(Version.LUCENE_35));
iwConf.setSimilarity(similarity);
IndexWriter iw = new IndexWriter(dir, iwConf);
Document doc = new Document();
Field field = new Field("text", "", Store.YES, Index.ANALYZED);
doc.add(field);
for (String value : Arrays.asList("a b c", "c d", "a b d", "a c d")) {
  field.setValue(value);
  iw.addDocument(doc);
}
iw.commit();
iw.close();

IndexReader ir = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(ir);
searcher.setSimilarity(similarity);
BooleanQuery q = new BooleanQuery();
q.add(new TermQuery(new Term("text", "a")), Occur.SHOULD);
q.add(new TermQuery(new Term("text", "b")), Occur.SHOULD);
q.add(new TermQuery(new Term("text", "d")), Occur.SHOULD);

TopDocs topDocs = searcher.search(q, 100);
System.out.println(topDocs.totalHits + " results");
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (int i = 0; i < scoreDocs.length; ++i) {
  int docId = scoreDocs[i].doc;
  float score = scoreDocs[i].score;
  System.out.println(ir.document(docId).get("text") + " -> " + score);
  System.out.println(searcher.explain(q, docId));
}
ir.close();

这篇关于计算Lucene中的匹配词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆