solr不标记受保护的单词 [英] solr not tokenizing protected words

查看:96
本文介绍了solr不标记受保护的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Solr/Lucene(3.x)中有一个带有特殊复制字段facet_headline的文档,以便有一个没有梗的字段来刻面.

I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.

有时2个或更多的单词属于一个单词,应将其视为一个单词,例如"kim jong il".

Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".

因此标题星期六:金正日死了"应该分为:

So the headline "Saturday: kim jong il had died" should be split into:

Saturday kim jong il had died

由于这个原因,我决定使用受保护的单词(protwords),并在其中添加kim jong il. schema.xml看起来像这样.

For this reason I decided to use protected words (protwords), where I add kim jong il. The schema.xml looks like this.

   <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
           <tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
           <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" 
                   protected="protwords.txt" />
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
        </analyzer>
   </fieldType>

使用solr分析似乎不起作用! 该字符串仍分为6个字.看起来好像没有使用protword.txt,但是如果标题仅包含名称:kim jong il一切正常,则不会拆分这些术语.

Using the solr analysis it looks like that doesn't work! The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il everything works fine, the terms aren't split.

有没有一种方法可以达到我的目标:不拆分特定的单词/单词组?

Is there a way to reach my goal: not to split specific words/word groups?

推荐答案

到了重点,不可能达到目标. 看起来,这并不是所有标记器和过滤器的重点.

after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.

这篇关于solr不标记受保护的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆