使用solr / lucene进行精确字段搜索 [英] Exact field search with solr/lucene

查看:865
本文介绍了使用solr / lucene进行精确字段搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有文本字段。
对于给定的查询,我想查找包含索引字段值的所有文档。

  query.contains(document。例如:
1. field_name:ab
2。



field_name:abc



对于查询abd,我只想查找第一项。

效率不高这样做的方式基本上是将查询和索引字段的所有子字符串生成为字符串。

是否可以使用存在功能在Solr中实现此类需求?
如果不是最有效的算法/方法是什么?



PS。看起来像谷歌adwords做这样的匹配找到添加。

解决方案

这是一种方法来做你要求的: p>

字段类型



 < fieldType name =exactclass = solr.TextFieldpositionIncrementGap =100> 
< analyzer type =index>
< tokenizer class =solr.KeywordTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorysplitOnCaseChange =0splitOnNumerics =0preserveOriginal =0generateWordParts =0catenateAll =1/>
< / analyzer>
< analyzer type =query>
< tokenizer class =solr.KeywordTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorysplitOnCaseChange =0splitOnNumerics =0preserveOriginal =0generateWordParts =1catenateAll =0/>
< filter class =solr.ShingleFilterFactoryoutputUnigrams =trueoutputUnigramsIfNoShingles =truetokenSeparator =maxShingleSize =99/>
< / analyzer>
< / fieldType>



解释:



索引分析器使用 WordDelimiterFilterFactory 将字段值拆分为单词。因此,使用你的例子, ab 被分成单词 a b abd 分成 a b d 。我们设置了 catenateAll =1 generateWordParts =0,这样单个单词就会被丢弃,字。 a b 变成 ab a b d 变成 abd



查询分析器类似,但有细微差异。除非我们不放弃单词或将它们连接起来,否则我们将该值分解成单词。相反,我们将这些单词传递给 ShingleFilterFactory ,它将 a b 并返回 a b ab

我们使用shingles代替concatenation的原因是允许 abc 匹配 ab bc 。如果您希望 abc 仅匹配 abc ,请设置 catenateAll =1

使用这个配置, ab 只会匹配 a b ab (不是 abd )。此外, abc 会匹配 a b c ab bc ABC 。还应该指出, ab 会匹配 a b 。如果这不符合您的要求,您应该能够配置木瓦和文字过滤器工厂来完成您所需要的工作。



编辑 strong>:此答案的以前版本会将魔术值标记为值的开始和结束。事实证明,这是不必要的。只需将值连接在一起就足以防止 ab abd



编辑2 index analyzer fix ): WhitespaceTokenizerFactory 应该是 KeywordTokenizerFactory 。此外, WordDelimiterFilterFactory 应该有 catenateAll =0


I have text field. And for given query I want to find all documents that contains indexed field values.

query.contains(document.field_name)

Examples: 1. field_name:"a b" 2. field_name:"a b c"

For query "a b d" I want to find only first item.

Not efficient way to do this is basically generate all substrings of query and index field as a string.

Is it possible to implements such requirements in Solr using existen functionality? If not what is the most efficient algorithm/way to do this?

PS. Seems like google adwords do such matching for finding adds.

解决方案

Here's one way to do what you're asking for:

Field Type

<fieldType name="exact" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="0" catenateAll="1" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="1" catenateAll="0" />
    <filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramsIfNoShingles="true" tokenSeparator="" maxShingleSize="99"/>
  </analyzer>
</fieldType>

Explanation:

The index analyzer uses WordDelimiterFilterFactory to split the field value into words. So using your example, a b is split into the wordsa and b, and a b d is split into a, b, and d. We set catenateAll="1" and generateWordParts="0" so the individual words are discarded, resulting in a single word. a and b become ab and a, b and d become abd.

The analyzer for queries is similar with minor differences. We split the value into words except we do not discard the words or concatenate them. Instead, we pass the words to the ShingleFilterFactory, which takes the a and b and returns a, b, and ab.

The reason we use shingles instead of concatenation is to allow a b c to match a b and b c. If you want a b c to only match a b c, set catenateAll="1" and remove the shingle factory.

Using this configuration, a b will match only a, b, and a b (not a b d). Also, a b c will match a, b, c, a b, b c, and a b c. It should also be noted that ab will match a b. If any of this is not what you want, you should be able to configure the shingle and word filter factories to do exactly what you need.

EDIT: Previous versions of this answer put magic values to mark the start and end of the value. It turns out that is unnecessary; just concatenating the values together is enough to prevent a b from matching a b d.

EDIT 2 (index analyzer fix): WhitespaceTokenizerFactory should have been KeywordTokenizerFactory. Also, the WordDelimiterFilterFactory should have catenateAll="0".

这篇关于使用solr / lucene进行精确字段搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆