使用带状疱疹和停止词与弹性和Lucene 4.4 [英] Using Shingles and Stop words with Elasticsearch and Lucene 4.4
问题描述
在我建立的索引中,我有兴趣运行查询,然后(使用facet)返回该查询的带状。以下是我在文本上使用的分析器:
In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}
主要的问题是,使用Lucene 4.4,停止过滤器不再支持 enable_position_increments
参数,以消除包含停止字的带状键。相反,我会得到结果如..
The major issue is that, with Lucene 4.4, stop filters no longer support the enable_position_increments
parameter to eliminate shingles that contain stop words. Instead, I'd get results like..
红色和黄色
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]
自然而然,这个GREATLY偏离了返回的带状疱疹数量。有没有一种方式post-Lucene 4.4来管理这个没有对结果进行后处理?
Naturally this GREATLY skews the number of shingles returned. Is there a way post-Lucene 4.4 to manage this without doing post-processing on the results?
推荐答案
可能不是最优解决方案,但最钝的是为分析器添加另一个过滤器来杀死_填充符号。在下面的例子中,我称之为kill_fillers:
Probably not the most optimal solution, but the most blunt would be to add another filter to your analyzer to kill "_" filler tokens. In the example below I called it "kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...
将kill_fillers过滤器添加到过滤器列表中:
Add "kill_fillers" filter to your list of filters:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}
这篇关于使用带状疱疹和停止词与弹性和Lucene 4.4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!