无法删除 Solr 中的标点符号 [英] Can't remove punctuation in Solr

查看:25
本文介绍了无法删除 Solr 中的标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我安装了 solr 来查询 Drupal 站点上的内容.许多标题字段在字符串的开头都有标点符号,因此当我按标题排序时,标点符号会出现在列表的顶部.

I have a solr install to query content on a Drupal site. Many of the title fields have punctuation at the start of the string and so when I sort by title the punctuation appears top of the list.

我想让 solr 在按标题排序时忽略标题,但我尝试过的所有解决方案都不起作用.

I would like to get solr to ignore the the title when sorting by title but none of the solutions I have tried work.

我对 solr 还很陌生,所以这可能是我做错了一些非常简单的事情......我不太了解 schema.xml 文件中发生的事情!

I am fairly new to solr and so it may be something really simple that I am doing wrong... I don't really understand much of what is going on in the schema.xml file!

标题字段在 solr 中称为标签,我在 solr.PatternReplaceFilterFactory 中尝试了各种方法,但都不起作用.

The title field is called label in solr and I have tried various methods in solr.PatternReplaceFilterFactory which do not work.

<field name="label" type="text" indexed="true" stored="true"     termVectors="true" omitNorms="true"/>
<copyField source="label" dest="sort_label"/>

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
           pattern="(^p{Punct}+)" replacement="" replace="all"
    />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="0"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

  </analyzer>
  <analyzer type="query">
…
</analyzer>

我的查询是start=0&rows=25&q=education&fl=id%2Centity_id%2Centity_type%2Cbundle%2Cbundle_name%2Csort_label%2Css_language%2Cis_comment_count%2Cds_created%2Cds_created%2Cds_created%2Cds_changed%2Cscore_changed%2Cscore_changed%2Cscore_changed%2Cscore_file%2Cment%2Cscore_file%2Cment_c2Css_file_entity_url&pf=content%5E2.0&&sort=sort_label%20asc

My query is start=0&rows=25&q=education&fl=id%2Centity_id%2Centity_type%2Cbundle%2Cbundle_name%2Csort_label%2Css_language%2Cis_comment_count%2Cds_created%2Cds_changed%2Cscore%2Cpath%2Curl%2Cis_uid%2Ctos_name%2Czm_parent_entity%2Css_filemime%2Css_file_entity_title%2Css_file_entity_url&pf=content%5E2.0&&sort=sort_label%20asc

推荐答案

这是通过 WordDelimiterFilterFactory 完成的.设置 generateWordParts=1. 将此过滤器添加到您的

This is done with the WordDelimiterFilterFactory. Set generateWordParts=1. Add this filter to your

修改schema.xml后重启服务器并重新索引数据.

After modifying the schema.xml restart the server and re-index the data.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
                protected="protwords.txt"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                splitOnCaseChange="0"
                preserveOriginal="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="100" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>
    </fieldType>

这篇关于无法删除 Solr 中的标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆