弹性搜索自定义分析器,用于连字符,下划线和数字 [英] Elasticsearch custom analyzer for hyphens, underscores, and numbers

查看:189
本文介绍了弹性搜索自定义分析器,用于连字符,下划线和数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

诚然,我不太熟悉ES的分析部分。这是索引布局:

  {
mappings:{
event:{
property:{
ipaddress:{
type:string
},
hostname:{
type string,
analyzer:my_analyzer,
fields:{
raw:{
type:string,
index:not_analyzed
}
}
}
}
}
},
设置:{
分析:{
过滤器:{
my_filter:{
type:word_delimiter,
preserve_original:true
}
},
analyzer:{
my_analyz呃:{
type:custom,
tokenizer:whitespace,
filter:[smallcase,my_filter]
}
}
}
}
}

你可以看到我已经尝试为主机名字段使用自定义分析器。当我使用这个查询找到名为WIN_1的主机时,这种工作:

  {
query :{
match:{
hostname:WIN_1
}
}
}
pre>

问题是它还会返回任何拥有1的主机名。使用 _analyze 端点,我可以看到这些数字也被标记了。

  {
tokens:[
{
token:win_1,
start_offset:0,
end_offset
type:word,
position:1
},
{
token:win,
start_offset :0,
end_offset:3,
type:word,
position:1
},
{
token:1,
start_offset:4,
end_offset:5,
type:word,
position b $ b}
]
}

我想要能够做的是搜索WIN,并获得任何具有WIN名称的主机。但是我还需要能够搜索WIN_1,并将其确定为主机或具有WIN_1名称的任何主机。以下是一些测试数据。

  {
ipaddress:192.168.1.253,
主机名:WIN_8_ENT_1
}
{
ipaddress:10.0.0.1,
hostname:server1
}
{
ipaddress:172.20.10.36,
hostname:ServA-1
}

希望有人可以指向正确的方向。这可能是我的简单查询也不是正确的方法。我已经倾注了ES文档,但它们并不是很好的例子。

解决方案

您可以将分析更改为使用模式分析器丢弃数字和分数:

  {
分析:{
分析器:{
word_only:{
type:pattern,
pattern:([^ \p {L}] +)
}
}
}
}

使用分析API:

  curl -XGET'localhost:9200 / {yourIndex} / _ analyze?analyzer = word_only& pretty = true'-d 'WIN_8_ENT_1'

返回:

 tokens:[{
token:win,
start_offset:0,
end_offset:3,
type:word,
position:1
},{
token:ent,
start_offset:6,
end_offset:9,
type:word,
position:2
}]

您的映射将成为:

  {
event:{
properties:{
ipaddress:{
type:string
},
hostname:{
type:string,
analyzer:word_only b $ bfields:{
raw:{
type:string,
index:not_analyzed
}
}
}
}
}
}

你可以使用multi_match查询来获取所需的结果:

  {
query:{
multi_match:{
fields:[
hostname,
hostname.raw
],
查询:WIN_1
}
}
}


Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

解决方案

You could change your analysis to use a pattern analyzer that discards the digits and under scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

Using the analyze API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

returns:

"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

Your mapping would become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}

这篇关于弹性搜索自定义分析器,用于连字符,下划线和数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆