ElasticSearch Analyzer和电子邮件的Tokenizer [英] ElasticSearch Analyzer and Tokenizer for Emails

查看:208
本文介绍了ElasticSearch Analyzer和电子邮件的Tokenizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Google或ES中找不到完美的解决方案,希望有人能在这里帮忙。



假设有五个电子邮件地址存储在现场email:

  1。 {email:john.doe@gmail.com} 
2. {email:john.doe@gmail.com,john.doe@outlook.com}
3。 {email:hello-john.doe@outlook.com}
4. {email:john.doe@outlook.com}
5. {email:john @ yahoo.com}

我想完成以下搜索方案:



[搜索 - >接收]



john.doe@gmail.com - > 1,2



john.doe@outlook.com - > 2,4



john@yahoo.com - > 5

john.doe - > 1,2,3,4



john - > 1,2 ,3,4,5



gmail.com - > 1,2



outlook.com - > 2,3,4



前三场比赛是必须的,而其余的比赛更精准越好,已经尝试了不同的索引组合/搜索分析器,标记器和过滤器,还尝试在匹配查询的条件下工作,但没有找到理想的解决方案,任何想法都是受欢迎的,对映射,分析器或whic没有限制h类查询使用,谢谢。

解决方案

映射

  PUT / test 
{
settings:{
analysis:{
filter :{
email:{
type:pattern_capture,
preserve_original:1,
patterns:[
@] +),
(\\ p {L} +),
(\\d +),
@(。+)
([^ - @] +)
]
}
},
analyzer:{
email:{
tokenizer:uax_url_email,
filter:[
email,
smallcase,
unique
]

}
}
},
mappings:{
emails:{
properties:{
电子邮件:{
type:string,
analyzer:email
}
}
}
}
}

测试数据

  POST / test / emails / _bulk 
{index {_id:1}}
{email:john.doe@gmail.com}
{index:{_ id:2}}
{email:john.doe@gmail.com,john.doe@outlook.com}
{index:{_ id:3}}
{电子邮件:hello-john.doe@outlook.com}
{index:{_ id:4}}
{email:john.doe@outlook。 com}
{index:{_ id:5}}
{email:john@yahoo.com}

要使用的查询

  GET / test / emails / _search 
{
query:{
term:{
email:john.doe@gmail.com
}
}
}


I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.

Suppose there are five email addresses stored under field "email":

1. {"email": "john.doe@gmail.com"}
2. {"email": "john.doe@gmail.com, john.doe@outlook.com"}
3. {"email": "hello-john.doe@outlook.com"}
4. {"email": "john.doe@outlook.com}
5. {"email": "john@yahoo.com"}

I want to fulfill the following searching scenarios:

[Search -> Receive]

"john.doe@gmail.com" -> 1,2

"john.doe@outlook.com" -> 2,4

"john@yahoo.com" -> 5

"john.doe" -> 1,2,3,4

"john" -> 1,2,3,4,5

"gmail.com" -> 1,2

"outlook.com" -> 2,3,4

The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.

解决方案

Mapping:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "email": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": [
            "([^@]+)",
            "(\\p{L}+)",
            "(\\d+)",
            "@(.+)",
            "([^-@]+)"
          ]
        }
      },
      "analyzer": {
        "email": {
          "tokenizer": "uax_url_email",
          "filter": [
            "email",
            "lowercase",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "email"
        }
      }
    }
  }
}

Test data:

POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "john.doe@gmail.com"}
{"index":{"_id":"2"}}
{"email": "john.doe@gmail.com, john.doe@outlook.com"}
{"index":{"_id":"3"}}
{"email": "hello-john.doe@outlook.com"}
{"index":{"_id":"4"}}
{"email": "john.doe@outlook.com"}
{"index":{"_id":"5"}}
{"email": "john@yahoo.com"}

Query to be used:

GET /test/emails/_search
{
  "query": {
    "term": {
      "email": "john.doe@gmail.com"
    }
  }
}

这篇关于ElasticSearch Analyzer和电子邮件的Tokenizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆