如何与SOLR/lucene中的搜索字符串子集匹配 [英] How to match against subsets of a search string in SOLR/lucene

查看:88
本文介绍了如何与SOLR/lucene中的搜索字符串子集匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不寻常的情况.通常,当您搜索文本索引时,您将针对具有大量术语的文档搜索少量的关键字.

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

例如,您可能搜索"quick brown"并期望与"quick brown fox jumps the lazy dog"相匹配.

For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

我遇到这样的情况,我的文档存储中有很多小短语,我希望将它们与一个较大的查询短语匹配.

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

例如,如果我有一个查询:

For example if I have a query:

  • 敏捷的棕色狐狸跳过了懒狗"

和文件

  • 快速棕色"
  • 狐狸翻身"
  • 懒狗"

我想查找查询中出现短语的文档.在这种情况下,快速棕色"和懒狗"(而不是狐狸翻过"),因为尽管令牌匹配,但它不是搜索字符串中的短语.

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

使用SOLR/lucene可以进行这种查询吗?

Is this sort of query possible with SOLR/lucene?

推荐答案

听起来您想在分析中使用ShingleFilter,以便为单词bigrams编制索引:因此在查询和索引时都添加ShingleFilterFactory.

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

在索引时,您的文档将按以下方式进行索引:

At index time your documents are then indexed as such:

  • 快速棕色"-> quick_brown
  • 狐狸翻越"-> fox_over
  • 懒狗"-> lazy_dog

查询时,查询变为:

  • 敏捷的棕色狐狸跳过懒惰的狗"->"the_quick quick_brown棕色的狐狸狐狸跳跃着oversthe懒惰的懒惰的狗"

这还是不好的,默认情况下它将构成一个短语查询. 因此,仅在您的查询分析器中,在ShingleFilterFactory之后添加PositionFilterFactory.这会拉平"查询中的位置,以便queryparser将输出视为同义词,这将产生带有这些子项的布尔查询(所有SHOULD子句,因此基本上是OR查询):

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

BooleanQuery:

BooleanQuery:

  • the_quick或
  • quick_brown OR
  • brown_fox或
  • ...

这应该是最高效的方式,因为它实际上只是对术语查询的布尔查询.

this should be the most performant way, as then its really just a booleanquery of termqueries.

这篇关于如何与SOLR/lucene中的搜索字符串子集匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆