Lucene自定义评分的数字字段 [英] Lucene custom scoring for numeric fields

查看:92
本文介绍了Lucene自定义评分的数字字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了在文本内容字段上具有tf-idf相似性的标准术语搜索之外,我还希望基于数字字段的相似性"进行评分.这种相似性取决于查询和文档中的值之间的距离(例如,高斯,其中m = [用户输入],s = 0.5)

I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on "similarity" of numeric fields. This similarity will be depending on distance between the value in query and in document (e.g. gaussian with m= [user input], s= 0.5)

即假设文档代表人,而人员文档有两个字段:

I.e. let's say documents represent people, and person document have two fields:

  • 说明(全文)
  • 年龄(数字).

我想查找类似文件

说明:(x y z)年龄:30

description:(x y z) age:30

但年龄不是过滤条件,而是分数的一部分(30岁乘数将是1.0,25岁乘数是0.8,依此类推)

but age to be not the filter, but rather part of score (for person of age 30 multiplier will be 1.0, for 25-year-old person 0.8 etc.)

这能以明智的方式实现吗?

Can this be achieved in a sensible manner?

最后,我发现可以通过使用CustomScoreQuery包装ValueSourceQuery和TermQuery来完成此操作.请在下面查看我的解决方案.

Finally I found out this can be done by wrapping ValueSourceQuery and TermQuery with CustomScoreQuery. See my solution below.

对于快速变化的Lucene版本,我只想补充一点,它已经在Lucene 3.0(Java)上进行了测试.

EDIT 2: With fast-changing versions of Lucene, I just want to add that it was tested on Lucene 3.0 (Java).

推荐答案

好的,因此,这里(有点冗长)的概念验证是完整的JUnit测试.尚未针对大索引测试其效率,但是从预热后的数据来看,它应该表现良好,并提供足够的RAM来缓存数字字段.

Okay, so here's (a bit verbose) proof-of-concept as a full JUnit test. Haven't tested its efficiency yet for large index, but from what I've read probably after a warm-up it should perform well, providing there's enough RAM available to cache numeric fields.

  package tests;

  import org.apache.lucene.analysis.Analyzer;
  import org.apache.lucene.analysis.WhitespaceAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.document.Field;
  import org.apache.lucene.document.NumericField;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.queryParser.QueryParser;
  import org.apache.lucene.search.IndexSearcher;
  import org.apache.lucene.search.Query;
  import org.apache.lucene.search.ScoreDoc;
  import org.apache.lucene.search.TopDocs;
  import org.apache.lucene.search.function.CustomScoreQuery;
  import org.apache.lucene.search.function.IntFieldSource;
  import org.apache.lucene.search.function.ValueSourceQuery;
  import org.apache.lucene.store.Directory;
  import org.apache.lucene.store.RAMDirectory;
  import org.apache.lucene.util.Version;

  import junit.framework.TestCase;

  public class AgeAndContentScoreQueryTest extends TestCase
  {
     public class AgeAndContentScoreQuery extends CustomScoreQuery
     {
        protected float peakX;
        protected float sigma;

        public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) {
           super(subQuery, valSrcQuery);
           this.setStrict(true); // do not normalize score values from ValueSourceQuery!
           this.peakX = peakX;   // age for which the age-relevance is best
           this.sigma = sigma;
        }

        @Override
        public float customScore(int doc, float subQueryScore, float valSrcScore){
           // subQueryScore is td-idf score from content query
           float contentScore = subQueryScore;

           // valSrcScore is a value of date-of-birth field, represented as a float
           // let's convert age value to gaussian-like age relevance score
           float x = (2011 - valSrcScore); // age
           float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);

           float finalScore = ageScore * contentScore;

           System.out.println("#contentScore: " + contentScore);
           System.out.println("#ageValue:     " + (int)valSrcScore);
           System.out.println("#ageScore:     " + ageScore);
           System.out.println("#finalScore:   " + finalScore);
           System.out.println("+++++++++++++++++");

           return finalScore;
        }
     }

     protected Directory directory;
     protected Analyzer analyzer = new WhitespaceAnalyzer();
     protected String fieldNameContent = "content";
     protected String fieldNameDOB = "dob";

     protected void setUp() throws Exception
     {
        directory = new RAMDirectory();
        analyzer = new WhitespaceAnalyzer();

        // indexed documents
        String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"};
        int[] dobs = {1991, 1981, 1987}; // date of birth

        IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i < contents.length; i++) 
        {
           Document doc = new Document();
           doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
           doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i]));      // store & index
           writer.addDocument(doc);
        }
        writer.close();
     }

     public void testSearch() throws Exception
     {
        String inputTextQuery = "foo bar";
        float peak = 27.0f;
        float sigma = 0.1f;

        QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
        Query contentQuery = parser.parse(inputTextQuery);

        ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
         // or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);

        CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);

        IndexSearcher searcher = new IndexSearcher(directory);
        TopDocs docs = searcher.search(finalQuery, 10);

        System.out.println("\nDocuments found:\n");
        for(ScoreDoc match : docs.scoreDocs)
        {
           Document d = searcher.doc(match.doc);
           System.out.println("CONTENT: " + d.get(fieldNameContent) );
           System.out.println("D.O.B.:  " + d.get(fieldNameDOB) );
           System.out.println("SCORE:   " + match.score );
           System.out.println("-----------------");
        }
     }
  }

这篇关于Lucene自定义评分的数字字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆