如何用Elasticsearch模糊匹配电子邮件或电话? [英] How to fuzzy match email or telephone by Elasticsearch?

查看:278
本文介绍了如何用Elasticsearch模糊匹配电子邮件或电话?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过Elasticsearch对电子邮件或电话进行模糊匹配。例如:



匹配所有电子邮件以 @ gmail.com结尾





匹配所有电话startwith 136



我知道我可以使用通配符,

  {
query:{
通配符:{
email:* gmail.com
}
}
}

但性能非常差。我试图使用regexp:

  {查询:{regexp:{email:{value * 163\.com *}}} 

但不起作用。 p>

有更好的方法吗?


curl -XGET localhost:9200 / user_data




  {
user_data:{
aliases :{},
mappings:{
user_data:{
properties:{
address:{
type:string
},
age:{
type:long
},
comment:{
type string
},
created_on:{
type:date,
format:dateOptionalTime
},
custom:{
properties:{
key:{
type:string
},
value:{
type:string
}
}
},
gender:{
type:string
},
name:{
type:string
},
qq:{
type:string

tel:{
type:string
},
updated_on:{
type:date ,
format:dateOptionalTime
},
}
}
},
设置:{
index:{
creation_date:1458832279465,
uuid:fbmthc3lR0ya51zCnWidYg,
number_of_replicas:1,
number_of_shards:5,
version:{
created :1070299
}
}
},
warmers:{}
}
}

映射:

  {
设置:{
分析:{
分析器:{
index_phone_analyzer:{
type:custom,
char_filter:[digit_only],
tokenizer:digit_edge_ngram_tokenizer,
filter:[trim]
},
search_phone_analyzer b $ btype:custom,
char_filter:[digit_only],
tokenizer:keyword,
filter:[trim]
},
index_email_analyzer:{
type:custom ,
tokenizer:standard,
filter:[smallcase,name_ngram_filter,trim]
},
search_email_analyzer
type:custom,
tokenizer:standard,
filter:[smallcase,trim]
}
},
char_filter:{
digit_only:{
type:pattern_replace,
pattern:\\D +,
replacement:
}
},
tokenizer:{
digit_edge_ngram_tokenizer:{
type:edgeNGram,
min_gram:3,
max_gram:15,
token_chars:[digit]
}
},
filter:{
name_ngram_filter:{
type:ngram,
min_gram:3,
max_gram:20
}
}
}
},
mappings:{
user_data:{
properties:{
name:{
type:string
analyzer:ik
},
age:{
type:integer
},
gender :{
type:string
},
qq:{
type:string
},
电子邮件:{
type:string,
analyzer:index_email_analyzer,
search_analyzer:search_email_analyzer
},
电话:{
type:string,
analyzer:index_phone_analyzer,
search_analyzer:search_phone_analyzer
},
地址:{
type:string,
analyzer:ik
},
comment:{
type:string,
analyzer:ik
},
created_on b $ btype:date,
format:dateOptionalTime
},
updated_on:{
type:date b $ bformat:dateOptionalTime
},
custom:{
type:nested,
properties:{
key:{
type:string
},
value:{
type:string
}
}
}
}
}
}
}


解决方案

一个简单的方法是创建一个自定义分析器,利用 n-gram令牌过滤器用于电子邮件(=> see index_email_analyzer search_email_analyzer + email_url_analyzer 以确定电子邮件匹配)和 edge-ngram令牌过滤器,供电话(=>见下文 index_phone_analyzer search_phone_analyzer )。



完整的索引定义如下。

  PUT myindex 
{
settings:{
analysis:{
analyzer:{
email_url_analyzer:{
type:custom,
tokenizer:uax_url_email ,
filter:[trim]
},
index_phone_analyzer:{
type:custom,
char_filter digit_only],
tokenizer:digit_edge_ngram_tokenizer,
filter:[trim]
},
search_phone_anal y$ {
$ bchar_filter:$$$$$$$ [trim]
},
index_email_analyzer:{
type:custom,
tokenizer:standard,
:[smallcase,name_ngram_filter,trim]
},
search_email_analyzer:{
type:custom,
tokenizer 标准,
过滤器:[小写,修剪]
}
},
char_filter:{
digit_only
type:pattern_replace,
pattern:\\D +,
替换:
}
},
tokenizer:{
digit_edge_ngram_tokenizer:{
type:edgeNGram,
min_gram:1,
max_gram 15,
token_chars:[digit]

},
filter:{
name_ngram_filter:{
type:ngram,
min_gram:1,
max_gram:20
}
}
}
},
mappings:{
your_type:{
properties:{
email:{
type:string,
analyzer:index_email_analyzer,
search_analyzer search_email_analyzer
},
phone:{
type:string,
analyzer:index_phone_analyzer,
search_analyzer search_phone_analyzer
}
}
}
}
}

现在,让我们一个接一个地剖析一下。



对于手机字段,想法是使用 index_phone_analyzer 来索引电话值,该值使用边缘-n图标记器来索引电话号码的所有前缀。因此,如果您的电话号码为 1362435647 ,则会生成以下令牌: 1 13 136 1362 13624 136243 1362435 13624356 13624356 136243564 1362435647



然后搜索时,我们使用另一个分析器 search_phone_analyzer ,这将简单地输入输入号码(例如 136 ),并使用简单的匹配匹配电话字段或术语查询:

  POST myindex 
{
查询:{
term:
{phone:136}
}
}

对于电子邮件字段,我们以类似的方式进行操作,因为我们使用 index_email_analyzer ,它使用一个ngram令牌过滤器,它将产生不同长度(1到20个字符之间)的所有可能的令牌,可以从电子邮件值中获取。例如: john@gmail.com 将被标记为 j jo joh ,..., gmail.com ,..., john@gmail.com



然后搜索时,我们将使用另一个分析器,名为 search_email_analyzer 这将采取输入,并尝试将其与索引的令牌相匹配。

  POST myindex 
{
query:{
term:
{email:@ gmail.com}
}
}

在此示例中未使用 email_url_analyzer 分析器,但我已经包含以防您需要匹配确切的电子邮件价值。


I want to make fuzzy match for email or telephone by Elasticsearch. For example:

match all emails end with @gmail.com

or

match all telephone startwith 136.

I know I can use wildcard,

{
 "query": {
    "wildcard" : {
      "email": "*gmail.com"
    }
  }
}

but the performance is very poor. I tried to use regexp:

{"query": {"regexp": {"email": {"value": "*163\.com*"} } } }

But doesn't work.

Is there better way to make it?

curl -XGET localhost:9200/user_data

{
    "user_data": {
        "aliases": {},
        "mappings": {
            "user_data": {
                "properties": {
                    "address": {
                        "type": "string"
                    },
                    "age": {
                        "type": "long"
                    },
                    "comment": {
                        "type": "string"
                    },
                    "created_on": {
                        "type": "date",
                        "format": "dateOptionalTime"
                    },
                    "custom": {
                        "properties": {
                            "key": {
                                "type": "string"
                            },
                            "value": {
                                "type": "string"
                            }
                        }
                    },
                    "gender": {
                        "type": "string"
                    },
                    "name": {
                        "type": "string"
                    },
                    "qq": {
                        "type": "string"
                    },
                    "tel": {
                        "type": "string"
                    },
                    "updated_on": {
                        "type": "date",
                        "format": "dateOptionalTime"
                    },
                }
            }
        },
        "settings": {
            "index": {
                "creation_date": "1458832279465",
                "uuid": "Fbmthc3lR0ya51zCnWidYg",
                "number_of_replicas": "1",
                "number_of_shards": "5",
                "version": {
                    "created": "1070299"
                }
            }
        },
        "warmers": {}
    }
}

the mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings" : {
    "user_data" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "analyzer" : "ik"
        },
        "age" : {
          "type" : "integer"
        },
        "gender": {
          "type" : "string"
        },
        "qq" : {
          "type" : "string"
        },
        "email" : {
          "type" : "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "tel" : {
          "type" : "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        },
        "address" : {
          "type": "string",
          "analyzer" : "ik"
        },
        "comment" : {
          "type" : "string",
          "analyzer" : "ik"
        },
        "created_on" : {
          "type" : "date",
          "format" : "dateOptionalTime"
        },
        "updated_on" : {
          "type" : "date",
          "format" : "dateOptionalTime"
        },
        "custom": {
          "type" : "nested",
          "properties" : {
            "key" : {
              "type" : "string"
            },
            "value" : {
              "type" : "string"
            }
          }
        }
      }
    }
  }
}

解决方案

An easy way to do this is to create a custom analyzer which makes use of the n-gram token filter for emails (=> see below index_email_analyzer and search_email_analyzer + email_url_analyzer for exact email matching) and edge-ngram token filter for phones (=> see below index_phone_analyzer and search_phone_analyzer).

The full index definition is available below.

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_url_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": [ "trim" ]
        },
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "1",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "phone": {
          "type": "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        }
      }
    }
  }
}

Now, let's dissect it one bit after another.

For the phone field, the idea is to index phone values with index_phone_analyzer, which uses an edge-ngram tokenizer in order to index all prefixes of the phone number. So if your phone number is 1362435647, the following tokens will be produced: 1, 13, 136, 1362, 13624, 136243, 1362435, 13624356, 13624356, 136243564, 1362435647.

Then when searching we use another analyzer search_phone_analyzer which will simply take the input number (e.g. 136) and match it against the phone field using a simple match or term query:

POST myindex
{ 
    "query": {
        "term": 
            { "phone": "136" }
    }
}

For the email field, we proceed in a similar way, in that we index the email values with the index_email_analyzer, which uses an ngram token filter, which will produce all possible tokens of varying length (between 1 and 20 chars) that can be taken from the email value. For instance: john@gmail.com will be tokenized to j, jo, joh, ..., gmail.com, ..., john@gmail.com.

Then when searching, we'll use another analyzer called search_email_analyzer which will take the input and try to match it against the indexed tokens.

POST myindex
{ 
    "query": {
        "term": 
            { "email": "@gmail.com" }
    }
}

The email_url_analyzer analyzer is not used in this example but I've included it just in case you need to match on the exact email value.

这篇关于如何用Elasticsearch模糊匹配电子邮件或电话?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆