仅使用Solr返回与足够的NGrams匹配的结果 [英] Return only results that match enough NGrams with Solr
问题描述
为了在Solr中达到一定程度的容错能力,我已经开始使用NGramFilterFactory
.以下是schema.xml
中的插入位:
To achieve some degree of fault tolerance with Solr I have started to use the NGramFilterFactory
. Here are the intersting bits from the schema.xml
:
<field name="text" type="text" indexed="true" stored="true"/>
<copyField source="text" dest="text_ngram" />
<field name="text_ngram" type="text_ngram" indexed="true" stored="false"/>
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3" />
</analyzer>
</fieldType>
我正在使用EDisMax
查询处理程序使用几乎所有的库存配置.以下是solrconfig.xml
中有趣的行:
I am using the EDisMax
query handler with pretty much the stock configuration. Here are the interesting lines from the solrconfig.xml
:
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
name name_ngram^0.001
</str>
<str name="mm">100%</str>
<str name="q.op">AND</str>
...
这很好,但是给了我很多无关的结果.我认为使用Solr的分析功能可以将问题归结为以下原因:
This works fine however gives me lots of irrelevant results. Using Solr's analyze capabilities I think I've tracked down the issue to the following cause:
查询分为NGrams.然后,Solr在text
字段中搜索标记化查询,或者在text_ngram
字段中搜索NGram之一.搜索某物"时,使用debug=query
将打印以下parsedquery
:
The query is broken down into NGrams. Then Solr searches for either the tokenized query in the text
field or one of the NGrams in the text_ngram
field. Using debug=query
will print out the following parsedquery
when searching for "something":
(+DisjunctionMaxQuery(((text_ngram:som text_ngram:ome text_ngram:met text_ngram:eth text_ngram:thi text_ngram:hin text_ngram:ing) | text:something)))/no_coord
如果我没看错,那就意味着
If I read this right it means that either
- 其中一个NGram需要匹配或
- 原始查询(标记化的)需要匹配
现在,由于其中一个NGram(eth
)相同,因此还会找到类似"ethernet"的项目.
Now this will also find items like "ethernet" as one of the NGrams (eth
) is the same.
我的问题是:如何为NGram匹配设置更高的阈值?有没有办法说仅当查询中至少有90%的NGram匹配时才返回项目"?确保100%的NGram匹配是没有意义的,因为这样会有效地消除容错能力.
My question is: How can I set a higher threshold for the NGram matches? Is there a way to say "only return the item if at least 90% of the NGrams from the query match"? Making sure that 100% of the NGrams match would not make sense as this would effectively kill the fault tolerance.
我想到的另一种方法是只返回相对于最高结果高于某个分数阈值的结果.这是因为与以太网"相比,项目某物"具有很高的相关性.因此,有一种方法可以挂接到Solr中,以仅返回具有例如的结果.至少是最高成绩得分的1/100?我读到有一种方法可以提供自定义HitCollector
,但是我真的找不到任何有关此的信息.
Another way I thought of was to return only results that are above a certain score threshold relative to the top result. This is because the item "something" will have a very high relevancy compared to "ethernet". So is there a way to hook into Solr to return only results that have eg. at least 1/100th of the score of the top result? I read that there is a way to provide a custom HitCollector
but I couldn't really find any info on this.
谢谢!
推荐答案
这个想法是要实现某种容错搜索.某人搜索某物"时,应该找到某物"
The idea was to achieve some kind of fault tolerant search. When someone searches for "someting" it should find "something"
Solr的SpellChecker进行模糊搜索,您可以在其上设置阈值 http://wiki.apache.org/solr/SpellCheckComponent .
Solr's SpellChecker does fuzzy search and you can set thresholds on it http://wiki.apache.org/solr/SpellCheckComponent .
这篇关于仅使用Solr返回与足够的NGrams匹配的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!