使用solr / lucene进行精确字段搜索 [英] Exact field search with solr/lucene
问题描述
我有文本字段。
对于给定的查询,我想查找包含索引字段值的所有文档。
query.contains(document。例如:
1. field_name:ab
2。
field_name:abc
对于查询abd,我只想查找第一项。
效率不高这样做的方式基本上是将查询和索引字段的所有子字符串生成为字符串。
是否可以使用存在功能在Solr中实现此类需求?
如果不是最有效的算法/方法是什么?
PS。看起来像谷歌adwords做这样的匹配找到添加。
这是一种方法来做你要求的: p>
字段类型
< fieldType name =exactclass = solr.TextFieldpositionIncrementGap =100>
< analyzer type =index>
< tokenizer class =solr.KeywordTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorysplitOnCaseChange =0splitOnNumerics =0preserveOriginal =0generateWordParts =0catenateAll =1/>
< / analyzer>
< analyzer type =query>
< tokenizer class =solr.KeywordTokenizerFactory/>
< filter class =solr.WordDelimiterFilterFactorysplitOnCaseChange =0splitOnNumerics =0preserveOriginal =0generateWordParts =1catenateAll =0/>
< filter class =solr.ShingleFilterFactoryoutputUnigrams =trueoutputUnigramsIfNoShingles =truetokenSeparator =maxShingleSize =99/>
< / analyzer>
< / fieldType>
解释:
索引分析器使用 WordDelimiterFilterFactory
将字段值拆分为单词。因此,使用你的例子, ab
被分成单词 a
和 b
和 abd
分成 a
, b
和 d
。我们设置了 catenateAll =1
和 generateWordParts =0
,这样单个单词就会被丢弃,字。 a
和 b
变成 ab
和 a
, b
和 d
变成 abd
。
查询分析器类似,但有细微差异。除非我们不放弃单词或将它们连接起来,否则我们将该值分解成单词。相反,我们将这些单词传递给 我们使用shingles代替concatenation的原因是允许 使用这个配置, 编辑 strong>:此答案的以前版本会将魔术值标记为值的开始和结束。事实证明,这是不必要的。只需将值连接在一起就足以防止 编辑2 ( index analyzer fix ): I have text field.
And for given query I want to find all documents that contains indexed field values. Examples:
1. field_name:"a b"
2. field_name:"a b c" For query "a b d" I want to find only first item. Not efficient way to do this is basically generate all substrings of query and index field as a string. Is it possible to implements such requirements in Solr using existen functionality?
If not what is the most efficient algorithm/way to do this? PS. Seems like google adwords do such matching for finding adds. Here's one way to do what you're asking for:
The index analyzer uses The analyzer for queries is similar with minor differences. We split the value into words except we do not discard the words or concatenate them. Instead, we pass the words to the The reason we use shingles instead of concatenation is to allow Using this configuration, EDIT: Previous versions of this answer put magic values to mark the start and end of the value. It turns out that is unnecessary; just concatenating the values together is enough to prevent EDIT 2 (index analyzer fix): 这篇关于使用solr / lucene进行精确字段搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! ShingleFilterFactory
,它将 a
和 b
并返回 a
, b
和 ab $ c $
abc
匹配 ab
和 bc
。如果您希望 abc
仅匹配 abc
,请设置 catenateAll =1$
ab
只会匹配 a
, b
和 ab
(不是 abd
)。此外, abc
会匹配 a
, b
, c
, ab
, bc
和 ABC
。还应该指出, ab
会匹配 a b
。如果这不符合您的要求,您应该能够配置木瓦和文字过滤器工厂来完成您所需要的工作。
ab abd
。
WhitespaceTokenizerFactory
应该是 KeywordTokenizerFactory
。此外, WordDelimiterFilterFactory
应该有 catenateAll =0
。query.contains(document.field_name)
Field Type
<fieldType name="exact" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="0" catenateAll="1" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="0" generateWordParts="1" catenateAll="0" />
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramsIfNoShingles="true" tokenSeparator="" maxShingleSize="99"/>
</analyzer>
</fieldType>
Explanation:
WordDelimiterFilterFactory
to split the field value into words. So using your example, a b
is split into the wordsa
and b
, and a b d
is split into a
, b
, and d
. We set catenateAll="1"
and generateWordParts="0"
so the individual words are discarded, resulting in a single word. a
and b
become ab
and a
, b
and d
become abd
.ShingleFilterFactory
, which takes the a
and b
and returns a
, b
, and ab
.a b c
to match a b
and b c
. If you want a b c
to only match a b c
, set catenateAll="1"
and remove the shingle factory.a b
will match only a
, b
, and a b
(not a b d
). Also, a b c
will match a
, b
, c
, a b
, b c
, and a b c
. It should also be noted that ab
will match a b
. If any of this is not what you want, you should be able to configure the shingle and word filter factories to do exactly what you need.a b
from matching a b d
.WhitespaceTokenizerFactory
should have been KeywordTokenizerFactory
. Also, the WordDelimiterFilterFactory
should have catenateAll="0"
.