lucene - 给予更多权重,更接近的期限是标题的开头 [英] lucene - give more weight the closer term is to beginning of title

查看:68
本文介绍了lucene - 给予更多权重,更接近的期限是标题的开头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解如何在索引时或查询时提升字段。但是,如何才能增加匹配一个更接近标题开头的术语的分数?

I understand how to boost fields either at index time or query time. However, how could I increase the score of matching a term closer to the beginning of a title?

示例:

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

我希望第一份文件得分更高,因为lucene更接近开头(忽略术语) freq for now)。

I would like the first document to score higher since "lucene" is closer to the beginning (ignoring term freq for now).

我看到如何使用SpanQuery来指定术语之间的接近程度,但我不知道如何使用有关位置的信息。字段。

I see how to use the SpanQuery for specifying the proximity between terms, but I'm not sure how to use the information about the position in the field.

我在Java中使用Lucene 4.1。

I am using Lucene 4.1 in Java.

推荐答案

我会使用 SpanFirstQuery ,它匹配字段开头附近的字词。由于所有跨度查询都依赖于位置,默认情况下在lucene中进行索引时启用。

I would make use of a SpanFirstQuery, which matches terms near the beginning of a field. As all span queries it relies on positions, enabled by default while indexing in lucene.

让我们独立测试它:你只需提供你的 SpanTermQuery 以及可以找到该术语的最大位置(在我的示例中为一个) )。

Let's test it independently: you just have to provide your SpanTermQuery and the maximum position where the term can be found (one in my example).

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

鉴于您的两个文件,此查询只会找到标题为Lucene:Homepage的第一个文件,如果你用 StandardAnalyzer 分析了它。

Given your two documents this query will find only the first one with title "Lucene: Homepage", if you analyzed it with the StandardAnalyzer.

现在我们可以以某种方式结合上面的 SpanFirstQuery 使用普通文本查询,并且第一个只影响分数。您可以使用 <$ c $轻松完成此操作c> BooleanQuery 并将span查询作为这样的should子句:

Now we can somehow combine the above SpanFirstQuery with a normal text query, and have the first one only influencing the score. You can easily do it using a BooleanQuery and putting the span query as a should clause like this:

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

可能有不同的方法来实现相同的目标,可能使用 CustomScoreQuery 也是自定义代码来实现评分,但在我看来这是最简单的。

There are probably different ways to achieve the same, maybe using a CustomScoreQuery too, or custom code to implement the scoring, but this seems to me the easiest one.

我用来测试它的代码打印出以下内容输出(包括分数)首先执行唯一的 TermQuery ,然后是唯一的 SpanFirstQuery ,最后是组合 BooleanQuery

The code I used to test it prints the following output (score included) executing the only TermQuery first, then the only SpanFirstQuery and finally the combined BooleanQuery:

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

以下是完整代码:

public static void main(String[] args) throws Exception {

        Directory directory = FSDirectory.open(new File("data"));

        index(directory);

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Term term = new Term("title", "lucene");

        System.out.println("------ TermQuery --------");
        TermQuery termQuery = new TermQuery(term);
        search(indexSearcher, termQuery);

        System.out.println("------ SpanFirstQuery --------");
        SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
        search(indexSearcher, spanFirstQuery);

        System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
        search(indexSearcher, booleanQuery);
    }

    private static void index(Directory directory) throws Exception {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType titleFieldType = new FieldType();
        titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        titleFieldType.setIndexed(true);
        titleFieldType.setStored(true);

        Document document = new Document();
        document.add(new Field("title","I have a question about lucene", titleFieldType));
        writer.addDocument(document);

        document = new Document();
        document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
        writer.addDocument(document);

        writer.close();
    }

    private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);

        System.out.println("Total hits: " + topDocs.totalHits);

        for (ScoreDoc hit : topDocs.scoreDocs) {
            Document result = indexSearcher.doc(hit.doc);
            for (IndexableField field : result) {
                System.out.println(field.name() + ": " + field.stringValue() +  " - score: " + hit.score);
            }
        }
    }

这篇关于lucene - 给予更多权重,更接近的期限是标题的开头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆