ElasticSearch Analyzer和电子邮件的Tokenizer [英] ElasticSearch Analyzer and Tokenizer for Emails
问题描述
在Google或ES中找不到完美的解决方案,希望有人能在这里帮忙。
假设有五个电子邮件地址存储在现场email:
1。 {email:john.doe@gmail.com}
2. {email:john.doe@gmail.com,john.doe@outlook.com}
3。 {email:hello-john.doe@outlook.com}
4. {email:john.doe@outlook.com}
5. {email:john @ yahoo.com}
我想完成以下搜索方案:
[搜索 - >接收]
john.doe@gmail.com - > 1,2
john.doe@outlook.com - > 2,4
john@yahoo.com - > 5
john.doe - > 1,2,3,4
john - > 1,2 ,3,4,5
gmail.com - > 1,2
outlook.com - > 2,3,4
前三场比赛是必须的,而其余的比赛更精准越好,已经尝试了不同的索引组合/搜索分析器,标记器和过滤器,还尝试在匹配查询的条件下工作,但没有找到理想的解决方案,任何想法都是受欢迎的,对映射,分析器或whic没有限制h类查询使用,谢谢。
映射:
PUT / test
{
settings:{
analysis:{
filter :{
email:{
type:pattern_capture,
preserve_original:1,
patterns:[
@] +),
(\\ p {L} +),
(\\d +),
@(。+)
([^ - @] +)
]
}
},
analyzer:{
email:{
tokenizer:uax_url_email,
filter:[
email,
smallcase,
unique
]
}
}
},
mappings:{
emails:{
properties:{
电子邮件:{
type:string,
analyzer:email
}
}
}
}
}
测试数据:
POST / test / emails / _bulk
{index {_id:1}}
{email:john.doe@gmail.com}
{index:{_ id:2}}
{email:john.doe@gmail.com,john.doe@outlook.com}
{index:{_ id:3}}
{电子邮件:hello-john.doe@outlook.com}
{index:{_ id:4}}
{email:john.doe@outlook。 com}
{index:{_ id:5}}
{email:john@yahoo.com}
要使用的查询:
GET / test / emails / _search
{
query:{
term:{
email:john.doe@gmail.com
}
}
}
I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.
Suppose there are five email addresses stored under field "email":
1. {"email": "john.doe@gmail.com"}
2. {"email": "john.doe@gmail.com, john.doe@outlook.com"}
3. {"email": "hello-john.doe@outlook.com"}
4. {"email": "john.doe@outlook.com}
5. {"email": "john@yahoo.com"}
I want to fulfill the following searching scenarios:
[Search -> Receive]
"john.doe@gmail.com" -> 1,2
"john.doe@outlook.com" -> 2,4
"john@yahoo.com" -> 5
"john.doe" -> 1,2,3,4
"john" -> 1,2,3,4,5
"gmail.com" -> 1,2
"outlook.com" -> 2,3,4
The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.
Mapping:
PUT /test
{
"settings": {
"analysis": {
"filter": {
"email": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([^@]+)",
"(\\p{L}+)",
"(\\d+)",
"@(.+)",
"([^-@]+)"
]
}
},
"analyzer": {
"email": {
"tokenizer": "uax_url_email",
"filter": [
"email",
"lowercase",
"unique"
]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
Test data:
POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "john.doe@gmail.com"}
{"index":{"_id":"2"}}
{"email": "john.doe@gmail.com, john.doe@outlook.com"}
{"index":{"_id":"3"}}
{"email": "hello-john.doe@outlook.com"}
{"index":{"_id":"4"}}
{"email": "john.doe@outlook.com"}
{"index":{"_id":"5"}}
{"email": "john@yahoo.com"}
Query to be used:
GET /test/emails/_search
{
"query": {
"term": {
"email": "john.doe@gmail.com"
}
}
}
这篇关于ElasticSearch Analyzer和电子邮件的Tokenizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!