Lucene分析器链:无填充令牌的ShingleFilter [英] Lucene Analyzer chain: ShingleFilter without filler tokens

查看:165
本文介绍了Lucene分析器链:无填充令牌的ShingleFilter的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的分析器链中,ShingleFilter在停用词过滤器之后.如文档,ShingleFilter通过插入填充符(带有termtext"_"的符)来处理位置增量> 1.

In my analyzer chain, ShingleFilter comes after stopword filter. As mentioned in the docs, ShingleFilter handles position increments > 1 by inserting filler tokens (tokens with termtext "_").

For example : "please divide this sentence into biword shingles" 

Shingles of size 2 : please divide, divide _, _ sentence, sentence _, _ biword, biword shingles (assuming that "this, "into" are stopwords)

我想用填充标记消除这些带状疱疹,即我所需的输出仅包含:请除以双字带状疱疹.

I would like to eliminate those shingles with the filler tokens, i.e. my desired output contains only: please divide, biword shingles.

我专门研究带状疱疹(最多4克)的刻面.由于这些停用词,对于除法_句子_"

I've a dedicated field for facets with shingles up to 4-grams. Due to these stopwords, all the facet constraints (or values) look useless with those fillers like "divide _ sentence _"

请你指导我.

使用Solr 4.4.

Using Solr 4.4.

更新

我想到了在StopFilter配置中将enablePositionIncrement设置为false.不确定是否可以解决问题,但是Lucene 4.4不再支持.

I thought of setting enablePositionIncrement to false in StopFilter configuration. Not sure whether that solves the problem or not but Lucene 4.4 doesn't support that anymore.

推荐答案

ShingleFilterFactory之后,在分析器链中添加PatternReplaceFilterFactory.将所有包含填充符令牌的令牌替换为空字符串,即".

Add PatternReplaceFilterFactory in your analyzer chain after ShingleFilterFactory. Replace all Token containing filler token with empty string i.e. "".

这可能会暂时解决您的问题,但对于永久性解决方案,必须编写自己的分析仪或自定义ShingleFilter.

This may solve your problem temporarily but for permanent solution have to write your own analyzer or customize ShingleFilter.

示例字段类型:

<fieldType name="text_general_shingle" class="solr.TextField" positionIncrementGap="100">     
        <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />       
        <filter class="solr.LowerCaseFilterFactory"/>           
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>       
    </analyzer>     
    </fieldType>

这篇关于Lucene分析器链:无填充令牌的ShingleFilter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆