Solr 模糊搜索相似词 [英] Solr Fuzzy Search for similar words

查看:18
本文介绍了Solr 模糊搜索相似词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对jahngir"~0.2 进行模糊搜索,但不会返回任何结果.我的索引有数据JAHANGIR RAHMAN MD"的记录.如果我尝试使用确切的单词jahangir"~0.2 进行搜索,它会起作用.有人可以帮忙吗,我做错了什么.我花了很多时间试图弄清楚 Solr Fuzzy 搜索是如何工作的.任何解释 Solr 模糊搜索的链接都会有所帮助.下面是我用于索引的文本字段.提前致谢.

I am trying to do a fuzzy search for "jahngir" ~ 0.2, which does not return any results. My indexes has records with data "JAHANGIR RAHMAN MD". If I try a search with exact word "jahangir" ~ 0.2, it works. Can someone please help, on what I am doing wrong. I have spent a lot of time trying to figure out on how the Solr Fuzzy search works. Any links which explain Solr Fuzzy search would be helpful. Below is the text field that I am using for indexing. Thanks in advance.

 <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
  </analyzer>
</fieldType>

<小时>

这是响应后对我有用的配置.谢谢!


Here is the configuration that worked for me after the response. Thanks!

<!-- Modified to fit fuzzy queries -->  
    <fieldType name="text_exact_fuzzy" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

推荐答案

不,您不需要启用词干提取,使用词干提取器可能会导致问题.

No, you do not need to enable stemming, and the use of a stemmer may be causing the problem.

您在文本字段上设置了太多过滤器.您正在将一个单词转换为 Porter 词干,这通常不是一个真正的单词,然后使用它的语音键.表面词很少会与索引中存储的语音键匹配.音标会与原词大不相同.

You have far too many filters on the text field. You are converting a word to a Porter stem, which often is not a real word, then taking the phonetic key of that. The surface word will rarely match the phonetic key stored in the index. The phonetic key will be very different from the original word.

使用管理 UI 中的分析器页面查看术语的处理方式.

Use the analyzer page in the admin UI to see how terms are processed.

我建议将近似匹配的种类分成不同的字段.

I recommend splitting the kinds of approximate match into different fields.

  • text_exact:小写,仅此而已
  • text_stem:小写和词干
  • text_phonetic:小写和双变音,不要词干

将模糊匹配与 text_exact 结合使用,因为它可以处理输入错误.不要对其他字段使用模糊.

Use fuzzy matching with text_exact, because it handles typing errors. Do not use fuzzy against the other fields.

您可以对这些字段进行不同的加权,完全匹配的匹配质量高于其他字段,因此它可以具有更大的权重.词干匹配比语音匹配更好,因此它的权重应该小于精确匹配,但大于语音匹配.

You can weight these fields differently, the exact match is a higher-quality match than the rest, so it can have a bigger weight. The stemmed match is a better match than phonetic, so it should have a weight smaller than exact, but bigger than phonetic.

这篇关于Solr 模糊搜索相似词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆