使用Lucene SpanQueries进行句子感知搜索 [英] Sentence aware search with Lucene SpanQueries

查看:88
本文介绍了使用Lucene SpanQueries进行句子感知搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用Lucene SpanQuery查找所有出现在单个句子中的术语红色",绿色"和蓝色"都出现的情况?

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?

我的第一种(不完整/不正确的)方法是编写一个分析器,该分析器将特殊的句子标记标记和句子的开头放置在与句子的第一个单词相同的位置,然后查询类似于以下内容的内容:

My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:

SpanQuery termsInSentence = new SpanNearQuery(
  SpanQuery[] {
    new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
    new SpanTermQuery( new Term ("red")),
    new SpanTermQuery( new Term ("green")),
    new SpanTermQuery( new Term ("blue")),
  },
  999999999999,
  false
);

SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));

SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);

当然,问题在于nextSentence并不是真正的 next 句子,它是 any 句子标记,包括句子中的匹配项.因此,这将无法正常工作.

The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.

我的下一个方法是创建一个分析器,将标记放置在句子之前(即第一个单词之前而不是与第一个单词相同的位置).问题是我必须考虑由MY_SPECIAL_SENTENCE_TOKEN引起的额外偏移.而且,当我使用朴素的模式来拆分句子(例如,在/\.\s+[A-Z0-9]/上拆分)时,这首先会很糟糕,因为当我搜索<时,我必须考虑所有(假)句子标记em> U. S. S. Enterprise .

My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.

那么...我应该如何处理呢?

So... how should I approach this?

推荐答案

我会将每个句子索引为Lucene文档,包括一个字段,该字段标记该句子来自哪个源文档.根据您的原始资料,可能会接受句子/LuceneDoc的开销.

I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.

这篇关于使用Lucene SpanQueries进行句子感知搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆