Lucene中词组查询的位置偏移 [英] Position offset for Phrase queries in Lucene

查看:51
本文介绍了Lucene中词组查询的位置偏移的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在自定义Highlighter插件(使用FVH)以输出给定搜索的查询词的位置偏移.到目前为止,我已经可以使用下面的代码来提取常规查询的偏移信息.但是,对于短语查询,即使它不是短语查询的一部分,代码也会返回所有查询词(即termSet)的位置偏移量.因此,我想知道在Lucene中是否有一种方法可以使用FVH来仅获取短语查询的匹配短语的偏移量信息?

I am working on customizing the Highlighter plugin(using FVH) to output the position offset of query terms for a given search. So far I have been able to extract the offset information for normal queries using the code below. However, for Phrase queries the code returns the position offset of all the query terms(i.e. termSet) even when it is not part of the Phrase query. Therefore, I am wondering if there is a way in Lucene to get the offset information of only the matched phrase for Phrase queries using FVH?

// In DefaultSolrHighlighter.java::doHighlightingByFastVectorHighlighter()

SolrIndexSearcher searcher = req.getSearcher();
TermFreqVector[] tvector = searcher.getReader().getTermFreqVectors(docId);
TermPositionVector tvposition = (TermPositionVector) tvector[0];

 Set<String> termSet = highlighter.getHitTermSet (fieldQuery, fieldName);

 int[] positions;
 List hitOffsetPositions = new ArrayList<String[]>();

 for (String term : termSet)
 {
    int index = tvposition.indexOf(term); 
    positions = tvposition.getTermPositions(index);

    StringBuilder sb = new StringBuilder();
    for (int pos : positions)
    {
        if (!Integer.toString(pos).isEmpty())
            sb.append( pos ).append(',');
    }
    hitOffsetPositions.add(sb.substring(0, sb.length() - 1).toString());
 }

 if( snippets != null && snippets.length > 0 )
{
  docSummaries.add( fieldName, snippets );
  docSummaries.add( "hitOffsetPositions", hitOffsetPositions);
}


// In FastVectorHighlighter.java
// Wrapper function to get query Terms
   public Set<String> getHitTermSet (FieldQuery fieldQuery, String fieldName)
  {
      Set<String> termSet = fieldQuery.getTermSet( fieldName );
      return termSet;
  }

当前输出:

<lst name="6H500F0">
  <arr name="name">
  <str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
    <str>2</str>
    <str>3</str>
    <str>10</str>
</arr>

预期输出:

<lst name="6H500F0">
  <arr name="name">
  <str> New <em>hard drive</em> 500 GB SATA-300 and old drive 200 GB</str>
</arr>
<arr name="hitOffsetPositions">
    <str>2</str>
    <str>3</str>
</arr>

我要突出显示的字段具有 termVectors ="true" termPositions ="true" termOffsets ="true" 并正在使用Lucene 3.1.0.

The field that I am trying to highlight has termVectors="true", termPositions="true" and termOffsets="true" and am using Lucene 3.1.0.

推荐答案

我无法获得FVH来正确处理短语查询,因此不得不开发自己的摘要程序.在此处中讨论了我的方法的要点.我最后要做的是创建一个对象数组,每个对象都是我从查询中提取的.每个对象都包含一个单词索引及其位置,以及是否已在某些匹配中使用它.这些实例是下面示例中的 TermAtPosition 实例.然后,给定位置范围和与短语查询相对应的单词标识(索引)数组,我遍历该数组,以查找匹配给定范围内的所有术语索引.如果找到匹配项,则将每个匹配项标记为已消耗,并将匹配范围添加到匹配项列表中.然后,我可以使用这些匹配项来对句子评分.这是匹配的代码:

I wasn't able to get the FVH to handle phrase queries correctly, and wound up having to develop my own summarizer. The gist of my approach is discussed here; what I wound up doing is creating an array of objects, one for each term that I pulled from the queries. Each object contains a word index and its position, and whether it was already used in some match. These instances are the TermAtPosition instances in the sample below. Then, given position span and an array of word identities (indexes) corresponding to a phrase query, I iterated through the array, looking to match all term indexes within the given span. If I found a match, I marked each matching term as being consumed, and added the matching span to a list of matches. I could then use these matches to score sentences. Here is the matching code:

protected void scorePassage(TermPositionVector v, String[] words, int span, 
                    float score, SentenceScore[] scores, Scorer scorer) {
    TermAtPosition[] order = getTermsInOrder(v, words);
    if (order.length < words.length)
        return;
    int positions[] = new int[words.length];
    List<int[]> matches = new ArrayList<int[]>();
    for(int t=0; t<order.length; t++) {
        TermAtPosition tap = order[t];
        if (tap.consumed)
            continue;

        int p = 0;
        positions[p++] = tap.position;
        for(int u=0; u<words.length; u++) {
            if (u == tap.termIndex)
                continue;
            int nextTermPos = spanContains(order, u, tap.position, span);
            if (nextTermPos == -1)
                break;
            positions[p++] = nextTermPos;
        }
        // got all terms
        if (p == words.length)
            matches.add(recordMatch(order, positions.clone()));
    }
    if (matches.size() > 0)
        for (SentenceScore sentenceScore: scores) {
            for(int[] matchingPositions: matches)
                scorer.scorePassage(sentenceScore, matchingPositions, score);
    }
}


protected int spanContains(TermAtPosition[] order, int targetWord, 
                  int start, int span) {
    for (int i=0; i<order.length; i++) {
        TermAtPosition tap = order[i];
        if (tap.consumed || tap.position <= start || 
                       (tap.position > start + span))
            continue;
        if (tap.termIndex == targetWord)
            return tap.position;
    }
    return -1;
}

这种方法似乎有效,但是很贪心.给定序列"a a b c",它将与第一个a匹配(将第二个a保留),然后与b和c匹配.我认为可以应用一些递归或整数编程来减少它的贪婪性,但是我不为所动,并且无论如何都希望有一种更快而不是更准确的算法.

This approach seems to work, but it is greedy. Given a sequence "a a b c" it will it match the first a (leaving the second a alone), and then match b and c. I think a bit of recursion or integer programming could be applied to make it less greedy, but I couldn't be bothered, and wanted a faster rather than a more accurate algorithm anyway.

这篇关于Lucene中词组查询的位置偏移的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆