solr不标记受保护的单词 [英] solr not tokenizing protected words
问题描述
我在Solr/Lucene(3.x)中有一个带有特殊复制字段facet_headline的文档,以便有一个没有梗的字段来刻面.
I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.
有时2个或更多的单词属于一个单词,应将其视为一个单词,例如"kim jong il".
Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".
因此标题星期六:金正日死了"应该分为:
So the headline "Saturday: kim jong il had died" should be split into:
Saturday
kim jong il
had
died
由于这个原因,我决定使用受保护的单词(protwords),并在其中添加kim jong il
.
schema.xml
看起来像这样.
For this reason I decided to use protected words (protwords), where I add kim jong il
.
The schema.xml
looks like this.
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\"|\(|\)|\\|\+|\*|<|>|([0-31]+\.)" />
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0"
protected="protwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
</analyzer>
</fieldType>
使用solr分析似乎不起作用!
该字符串仍分为6个字.看起来好像没有使用protword.txt,但是如果标题仅包含名称:kim jong il
一切正常,则不会拆分这些术语.
Using the solr analysis it looks like that doesn't work!
The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il
everything works fine, the terms aren't split.
有没有一种方法可以达到我的目标:不拆分特定的单词/单词组?
Is there a way to reach my goal: not to split specific words/word groups?
推荐答案
到了重点,不可能达到目标. 看起来,这并不是所有标记器和过滤器的重点.
after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.
这篇关于solr不标记受保护的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!