仅使用Solr返回与足够的NGrams匹配的结果 [英] Return only results that match enough NGrams with Solr

查看:239
本文介绍了仅使用Solr返回与足够的NGrams匹配的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了在Solr中达到一定程度的容错能力,我已经开始使用NGramFilterFactory.以下是schema.xml中的插入位:

To achieve some degree of fault tolerance with Solr I have started to use the NGramFilterFactory. Here are the intersting bits from the schema.xml:

<field name="text" type="text" indexed="true" stored="true"/>
<copyField source="text" dest="text_ngram" />
<field name="text_ngram" type="text_ngram" indexed="true" stored="false"/>

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3" />
    </analyzer>
</fieldType>

我正在使用EDisMax查询处理程序使用几乎所有的库存配置.以下是solrconfig.xml中有趣的行:

I am using the EDisMax query handler with pretty much the stock configuration. Here are the interesting lines from the solrconfig.xml:

  <requestHandler name="/browse" class="solr.SearchHandler">
    <lst name="defaults">
      <!-- Query settings -->
      <str name="defType">edismax</str>
      <str name="qf">
        name name_ngram^0.001
      </str>
      <str name="mm">100%</str>
      <str name="q.op">AND</str>
      ...

这很好,但是给了我很多无关的结果.我认为使用Solr的分析功能可以将问题归结为以下原因:

This works fine however gives me lots of irrelevant results. Using Solr's analyze capabilities I think I've tracked down the issue to the following cause:

查询分为NGrams.然后,Solr在text字段中搜索标记化查询,或者在text_ngram字段中搜索NGram之一.搜索某物"时,使用debug=query将打印以下parsedquery:

The query is broken down into NGrams. Then Solr searches for either the tokenized query in the text field or one of the NGrams in the text_ngram field. Using debug=query will print out the following parsedquery when searching for "something":

(+DisjunctionMaxQuery(((text_ngram:som text_ngram:ome text_ngram:met text_ngram:eth text_ngram:thi text_ngram:hin text_ngram:ing) | text:something)))/no_coord

如果我没看错,那就意味着

If I read this right it means that either

  1. 其中一个NGram需要匹配或
  2. 原始查询(标记化的)需要匹配

现在,由于其中一个NGram(eth)相同,因此还会找到类似"ethernet"的项目.

Now this will also find items like "ethernet" as one of the NGrams (eth) is the same.

我的问题是:如何为NGram匹配设置更高的阈值?有没有办法说仅当查询中至少有90%的NGram匹配时才返回项目"?确保100%的NGram匹配是没有意义的,因为这样会有效地消除容错能力.

My question is: How can I set a higher threshold for the NGram matches? Is there a way to say "only return the item if at least 90% of the NGrams from the query match"? Making sure that 100% of the NGrams match would not make sense as this would effectively kill the fault tolerance.

我想到的另一种方法是只返回相对于最高结果高于某个分数阈值的结果.这是因为与以太网"相比,项目某物"具有很高的相关性.因此,有一种方法可以挂接到Solr中,以仅返回具有例如的结果.至少是最高成绩得分的1/100?我读到有一种方法可以提供自定义HitCollector,但是我真的找不到任何有关此的信息.

Another way I thought of was to return only results that are above a certain score threshold relative to the top result. This is because the item "something" will have a very high relevancy compared to "ethernet". So is there a way to hook into Solr to return only results that have eg. at least 1/100th of the score of the top result? I read that there is a way to provide a custom HitCollector but I couldn't really find any info on this.

谢谢!

推荐答案

这个想法是要实现某种容错搜索.某人搜索某物"时,应该找到某物"

The idea was to achieve some kind of fault tolerant search. When someone searches for "someting" it should find "something"

Solr的SpellChecker进行模糊搜索,您可以在其上设置阈值 http://wiki.apache.org/solr/SpellCheckComponent .

Solr's SpellChecker does fuzzy search and you can set thresholds on it http://wiki.apache.org/solr/SpellCheckComponent .

这篇关于仅使用Solr返回与足够的NGrams匹配的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆