如何在 SOLR/lucene 中匹配搜索字符串的子集 [英] How to match against subsets of a search string in SOLR/lucene

查看:34
本文介绍了如何在 SOLR/lucene 中匹配搜索字符串的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了不寻常的情况.通常,当您搜索文本索引时,您会针对包含大量术语的文档搜索少量关键字.

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

例如,您可能搜索quick brown"并期望匹配the quick brown fox jumps over the lazy dog".

For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

我的文档存储中有很多小短语,我希望将它们与较大的查询短语进行匹配.

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

例如,如果我有一个查询:

For example if I have a query:

  • 敏捷的棕色狐狸跳过懒狗"

和文件

  • 快速棕色"
  • 狐狸精"
  • 懒狗"

我想查找包含出现在查询中的短语的文档.在这种情况下,quick brown"和lazy dog"(但不是fox over",因为虽然标记匹配但它不是搜索字符串中的短语).

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

SOLR/lucene 是否可以进行这种查询?

Is this sort of query possible with SOLR/lucene?

推荐答案

听起来您想在分析中使用 ShingleFilter,以便索引单词 bigram:因此在查询和索引时都添加 ShingleFilterFactory.

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

在索引时,您的文档将被编入索引:

At index time your documents are then indexed as such:

  • 快速棕色"-> quick_brown
  • "fox over" -> fox_over
  • 懒狗"->lazy_dog

在查询时,您的查询变为:

At query time your query becomes:

  • 敏捷的棕色狐狸跳过懒惰的狗"->the_quick quick_brown brown_fox fox_jumps jumps_over_the_lazy lazy_dog"

这样还是不行,默认会形成词组查询.因此,在您的仅查询分析器中,在 ShingleFilterFactory 之后添加 PositionFilterFactory.这将展平"查询中的位置,以便查询解析器将输出视为同义词,这将产生一个带有这些子项的布尔查询(所有 SHOULD 子句,所以它基本上是一个 OR 查询):

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

布尔查询:

  • the_quick 或
  • quick_brown 或
  • brown_fox 或
  • ...

这应该是最高效的方式,因为它实际上只是术语查询的布尔查询.

this should be the most performant way, as then its really just a booleanquery of termqueries.

这篇关于如何在 SOLR/lucene 中匹配搜索字符串的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆