用于连字符、下划线和数字的 Elasticsearch 自定义分析器 [英] Elasticsearch custom analyzer for hyphens, underscores, and numbers

查看:31
本文介绍了用于连字符、下划线和数字的 Elasticsearch 自定义分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

诚然,我对 ES 的分析部分不是很精通.这是索引布局:

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

您可以看到我尝试对主机名字段使用自定义分析器.当我使用此查询查找名为WIN_1"的主机时,这种工作方式:

You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

问题在于它还会返回任何包含 1 的主机名.使用 _analyze 端点,我可以看到数字也被标记了.

The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

我希望能够做的是搜索 WIN 并取回任何名称中包含 WIN 的主机.但我还需要能够搜索 WIN_1 并取回确切的主机或名称中带有 WIN_1 的任何主机.下面是一些测试数据.

What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

希望有人能指出我正确的方向.也可能是我的简单查询也不是正确的方法.我已经翻遍了 ES 文档,但它们在示例中并不是很好.

Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

推荐答案

这是我最终得到的分析器和查询:

Here's the analyzer and queries I ended up with:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "hostname_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "hostname_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 0,
                    "patterns": [
                        "(\p{Ll}{3,})"
                    ]
                }
            },
            "analyzer": {
                "hostname_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [  "lowercase", "hostname_filter" ]
                }
            }
        }
    }
}

查询:查找以以下开头的主机名:

Queries: Find host name starting with:

{
    "query": {
        "prefix": {
            "hostname.raw": "WIN_8"
        }
    }
}

查找包含以下内容的主机名:

Find host name containing:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN"
       }
   }
}

感谢丹让我朝着正确的方向前进.

Thanks to Dan for getting me in the right direction.

这篇关于用于连字符、下划线和数字的 Elasticsearch 自定义分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆