使用Lucene SpanQueries进行句子感知搜索 [英] Sentence aware search with Lucene SpanQueries

查看：88 发布时间：2020/5/4 7:37:34 search lucene sentence

本文介绍了使用Lucene SpanQueries进行句子感知搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以使用Lucene SpanQuery查找所有出现在单个句子中的术语红色"，绿色"和蓝色"都出现的情况?

Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?

我的第一种(不完整/不正确的)方法是编写一个分析器，该分析器将特殊的句子标记标记和句子的开头放置在与句子的第一个单词相同的位置，然后查询类似于以下内容的内容:

My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:

SpanQuery termsInSentence = new SpanNearQuery(
  SpanQuery[] {
    new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
    new SpanTermQuery( new Term ("red")),
    new SpanTermQuery( new Term ("green")),
    new SpanTermQuery( new Term ("blue")),
  },
  999999999999,
  false
);

SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));

SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);

当然，问题在于nextSentence并不是真正的 next 句子，它是 any 句子标记，包括句子中的匹配项.因此，这将无法正常工作.

The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.

我的下一个方法是创建一个分析器，将标记放置在句子之前(即第一个单词之前而不是与第一个单词相同的位置).问题是我必须考虑由MY_SPECIAL_SENTENCE_TOKEN引起的额外偏移.而且，当我使用朴素的模式来拆分句子(例如，在/\.\s+[A-Z0-9]/上拆分)时，这首先会很糟糕，因为当我搜索<时，我必须考虑所有(假)句子标记em> U. S. S. Enterprise .

My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.

那么...我应该如何处理呢?

So... how should I approach this?

使用Lucene SpanQueries进行句子感知搜索 [英] Sentence aware search with Lucene SpanQueries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Lucene SpanQueries进行句子感知搜索 [英] Sentence aware search with Lucene SpanQueries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭