Solr停用词替换为_符号 [英] Solr stop words replaced with _ symbol

查看:138
本文介绍了Solr停用词替换为_符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在自动建议中遇到solr停用词的问题.所有停用词均由_符号代替.

I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol.

例如,我在字段"deal_title"中有文本"the simple text in".当我尝试搜索单词"simple"时,solr向我显示下一个结果"_ simple text _",但我希望使用"simple text".

For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text".

有人可以解释一下为什么这样做会如此以及如何解决吗? 这是我的schema.xml的一部分

Could someone explain me why this works in such way and how to fix it ? Here is part of my schema.xml

<fieldType class="solr.TextField" name="text_auto">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" /> 
    </analyzer> 
    <analyzer type="query">
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <tokenizer class="solr.StandardTokenizerFactory"/> 
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    </analyzer>
</fieldType>

<field name="deal_title" type="text_auto" indexed="true" stored="true" required="false" multiValued="false"/>

<fieldType name="text_general" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

推荐答案

我在Solr 6.3(不再可能不再使用enablePositionIncrements="false")中解决此问题的方法是:

My solution to this in Solr 6.3 (where enablePositionIncrements="false" isn't possible anymore) was to:

  1. 删除停用词
  2. fillerToken=""的带状疱疹(除去了_)
  3. 删除前导和尾随空格
  4. 删除重复项

  1. remove stopwords
  2. shingle with fillerToken="" (which removes the _)
  3. remove leading and trailing spaced
  4. remove duplicates

<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.ShingleFilterFactory" fillerToken=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(^ | $)" replacement=""/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

这篇关于Solr停用词替换为_符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆