如何匹配 SOLR/lucene 中搜索字符串的子集 [英] How to match against subsets of a search string in SOLR/lucene

查看:20
本文介绍了如何匹配 SOLR/lucene 中搜索字符串的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个不寻常的情况.通常,当您搜索文本索引时,您是在针对具有大量术语的文档搜索少量关键字.

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

例如,您可能搜索quick brown"并期望匹配the quick brown fox jumps over the lazy dog".

For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

我的文档存储中有很多小短语,我希望将它们与更大的查询短语进行匹配.

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

例如,如果我有一个查询:

For example if I have a query:

  • 敏捷的棕狐跳过懒惰的狗"

和文件

  • 快速棕色"
  • 狐疑"
  • 懒狗"

我想查找在查询中出现短语的文档.在这种情况下,quick brown"和lazy dog"(但不是fox over",因为尽管标记匹配,但它不是搜索字符串中的短语).

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

SOLR/lucene 可以进行这种查询吗?

Is this sort of query possible with SOLR/lucene?

推荐答案

听起来您想在分析中使用 ShingleFilter,以便索引单词二元组:所以在查询和索引时都添加 ShingleFilterFactory.

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

在索引时,您的文档将按如下方式编入索引:

At index time your documents are then indexed as such:

  • 快速棕色"-> quick_brown
  • 狐狸"-> fox_over
  • 懒狗"->lazy_dog

在查询时,您的查询变为:

At query time your query becomes:

  • "the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the_lazy lazy_dog"

这样还是不行,默认会形成词组查询.因此,在您的仅查询分析器中,在 ShingleFilterFactory 之后添加 PositionFilterFactory.这会展平"查询中的位置,以便查询解析器将输出视为同义词,这将产生一个带有这些子句的布尔查询(所有应该子句,所以它基本上是一个 OR 查询):

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

布尔查询:

  • the_quick 或
  • quick_brown 或
  • brown_fox 或
  • ...

这应该是最高效的方式,因为它实际上只是术语查询的布尔查询.

this should be the most performant way, as then its really just a booleanquery of termqueries.

这篇关于如何匹配 SOLR/lucene 中搜索字符串的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆