使用带状疱疹和停止词与弹性和Lucene 4.4 [英] Using Shingles and Stop words with Elasticsearch and Lucene 4.4

查看:144
本文介绍了使用带状疱疹和停止词与弹性和Lucene 4.4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我建立的索引中,我有兴趣运行查询,然后(使用facet)返回该查询的带状。以下是我在文本上使用的分析器:

In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer"
          ]
        }
      },
      "filter": {
        "custom_stemmer" : {
            "type": "stemmer",
            "name": "english"
        },
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "2",
            "max_shingle_size": "3"
        }
      }
    }
  }
}

主要的问题是,使用Lucene 4.4,停止过滤器不再支持 enable_position_increments 参数,以消除包含停止字的带状键。相反,我会得到结果如..

The major issue is that, with Lucene 4.4, stop filters no longer support the enable_position_increments parameter to eliminate shingles that contain stop words. Instead, I'd get results like..

红色和黄色

"terms": [
    {
        "term": "red",
        "count": 43
    },
    {
        "term": "red _",
        "count": 43
    },
    {
        "term": "red _ yellow",
        "count": 43
    },
    {
        "term": "_ yellow",
        "count": 42
    },
    {
        "term": "yellow",
        "count": 42
    }
]

自然而然,这个GREATLY偏离了返回的带状疱疹数量。有没有一种方式post-Lucene 4.4来管理这个没有对结果进行后处理?

Naturally this GREATLY skews the number of shingles returned. Is there a way post-Lucene 4.4 to manage this without doing post-processing on the results?

推荐答案

可能不是最优解决方案,但最钝的是为分析器添加另一个过滤器来杀死_填充符号。在下面的例子中,我称之为kill_fillers:

Probably not the most optimal solution, but the most blunt would be to add another filter to your analyzer to kill "_" filler tokens. In the example below I called it "kill_fillers":

   "shingleAnalyzer": {
      "tokenizer": "standard",
      "filter": [
        "standard",
        "lowercase",
        "custom_stop",
        "custom_shingle",
        "custom_stemmer",
        "kill_fillers"
       ],
       ...

将kill_fillers过滤器添加到过滤器列表中:

Add "kill_fillers" filter to your list of filters:

"filters":{
...
  "kill_fillers": {
    "type": "pattern_replace",
    "pattern": ".*_.*",
    "replace": "",
  },
...
}

这篇关于使用带状疱疹和停止词与弹性和Lucene 4.4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆