如何用Elasticsearch模糊匹配电子邮件或电话？ [英] How to fuzzy match email or telephone by Elasticsearch?

查看：278 发布时间：2017/8/6 22:35:54 mysql elasticsearch

本文介绍了如何用Elasticsearch模糊匹配电子邮件或电话？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想通过Elasticsearch对电子邮件或电话进行模糊匹配。例如：

匹配所有电子邮件以 @ gmail.com结尾

或

匹配所有电话startwith 136 。

我知道我可以使用通配符，

  {
query：{
 通配符：{
email：* gmail.com
} 
} 
}

但性能非常差。我试图使用regexp：

  {查询：{regexp：{email：{value * 163\.com *}}}

但不起作用。 p>

有更好的方法吗？

curl -XGET localhost：9200 / user_data

  {
user_data：{
aliases ：{}，
mappings：{
user_data：{
properties：{
address：{
type：string 
}，
age：{
type：long
}，
comment：{
type string
}，
created_on：{
type：date，
format：dateOptionalTime
}，
 custom：{
properties：{
key：{
type：string
}，
value：{
type：string
} 
} 
}，
gender：{
type：string
 }，
name：{
type：string
}，
qq：{
type：string
 
tel：{
type：string
}，
updated_on：{
type：date ，
format：dateOptionalTime
 }，
} 
} 
}，
设置：{
index：{
creation_date：1458832279465，
uuid：fbmthc3lR0ya51zCnWidYg，
number_of_replicas：1，
number_of_shards：5，
version：{
created ：1070299
} 
} 
}，
warmers：{} 
} 
}

映射：

  {
设置：{
分析：{
分析器：{
index_phone_analyzer：{
type：custom，
 char_filter：[digit_only]，
tokenizer：digit_edge_ngram_tokenizer，
filter：[trim] 
}，
search_phone_analyzer b $ btype：custom，
char_filter：[digit_only]， 
tokenizer：keyword，
filter：[trim] 
}，
index_email_analyzer：{
type：custom ，
tokenizer：standard，
filter：[smallcase，name_ngram_filter，trim] 
}，
search_email_analyzer 
type：custom，
tokenizer：standard，
filter：[smallcase，trim] 
} 
 }，
char_filter：{
digit_only：{
type：pattern_replace，
pattern：\\D +，
replacement：
} 
}，
tokenizer：{
digit_edge_ngram_tokenizer：{
type：edgeNGram，
min_gram：3，
max_gram：15，
token_chars：[digit] 
} 
}，
filter：{
name_ngram_filter：{
 type：ngram，
min_gram：3，
max_gram：20
} 
} 
} 
 }，
mappings：{
user_data：{
properties：{
name：{
type：string 
analyzer：ik
}，
age：{
type：integer
}，
gender ：{
type：string
}，
qq：{
type：string
}，
电子邮件：{
type：string，
analyzer：index_email_analyzer，
search_analyzer：search_email_analyzer
}，
电话：{
type：string，
analyzer：index_phone_analyzer，
search_analyzer：search_phone_analyzer
}，
地址：{
type：string，
analyzer：ik
 }，
comment：{
type：string，
analyzer：ik
}，
created_on b $ btype：date，
format：dateOptionalTime
}，
updated_on：{
type：date b $ bformat：dateOptionalTime
}，
custom：{
type：nested，
properties：{
 key：{
type：string
}，
value：{
type：string
} 
} 
} 
} 
} 
} 
}

解决方案

一个简单的方法是创建一个自定义分析器，利用 n-gram令牌过滤器用于电子邮件（=> see index_email_analyzer 和 search_email_analyzer + email_url_analyzer 以确定电子邮件匹配）和 edge-ngram令牌过滤器，供电话（=>见下文 index_phone_analyzer 和 search_phone_analyzer ）。

完整的索引定义如下。

  PUT myindex 
 {
settings：{ 
analysis：{
analyzer：{
email_url_analyzer：{
type：custom，
tokenizer：uax_url_email ，
filter：[trim] 
}，
index_phone_analyzer：{
type：custom，
char_filter digit_only]，
tokenizer：digit_edge_ngram_tokenizer，
filter：[trim] 
}，
search_phone_anal y$ {
 $ bchar_filter：$$$$$$$ [trim] 
}，
index_email_analyzer：{
type：custom，
tokenizer：standard，
 ：[smallcase，name_ngram_filter，trim] 
}，
search_email_analyzer：{
type：custom，
tokenizer 标准，
过滤器：[小写，修剪] 
} 
}，
char_filter：{
digit_only 
type：pattern_replace，
pattern：\\D +，
替换：
} 
}，
tokenizer：{
digit_edge_ngram_tokenizer：{
type：edgeNGram，
min_gram：1，
max_gram 15，
token_chars：[digit] 
 
}，
filter：{
name_ngram_filter：{
type：ngram，
min_gram：1，
max_gram：20
} 
} 
} 
}，
mappings：{
your_type：{
properties：{
email：{
type：string，
analyzer：index_email_analyzer，
search_analyzer search_email_analyzer
}，
phone：{
type：string，
analyzer：index_phone_analyzer，
search_analyzer search_phone_analyzer
} 
} 
} 
} 
}

现在，让我们一个接一个地剖析一下。

对于手机字段，想法是使用 index_phone_analyzer 来索引电话值，该值使用边缘-n图标记器来索引电话号码的所有前缀。因此，如果您的电话号码为 1362435647 ，则会生成以下令牌： 1 ， 13 ， 136 ， 1362 ， 13624 ， 136243 ， 1362435 ， 13624356 ， 13624356 ， 136243564 ， 1362435647 。

然后搜索时，我们使用另一个分析器 search_phone_analyzer ，这将简单地输入输入号码（例如 136 ），并使用简单的匹配匹配电话字段或术语查询：

  POST myindex 
 {
查询：{
term：
 {phone：136} 
} 
}

对于电子邮件字段，我们以类似的方式进行操作，因为我们使用 index_email_analyzer ，它使用一个ngram令牌过滤器，它将产生不同长度（1到20个字符之间）的所有可能的令牌，可以从电子邮件值中获取。例如： john@gmail.com 将被标记为 j ， jo ， joh ，...， gmail.com ，...， john@gmail.com 。

然后搜索时，我们将使用另一个分析器，名为 search_email_analyzer 这将采取输入，并尝试将其与索引的令牌相匹配。

  POST myindex 
 { 
query：{
term：
 {email：@ gmail.com} 
} 
}

在此示例中未使用 email_url_analyzer 分析器，但我已经包含以防您需要匹配确切的电子邮件价值。

I want to make fuzzy match for email or telephone by Elasticsearch. For example:

match all emails end with @gmail.com

match all telephone startwith 136.

I know I can use wildcard,

{
 "query": {
    "wildcard" : {
      "email": "*gmail.com"
    }
  }
}

but the performance is very poor. I tried to use regexp:

{"query": {"regexp": {"email": {"value": "*163\.com*"} } } }

But doesn't work.

Is there better way to make it?

curl -XGET localhost:9200/user_data

{
    "user_data": {
        "aliases": {},
        "mappings": {
            "user_data": {
                "properties": {
                    "address": {
                        "type": "string"
                    },
                    "age": {
                        "type": "long"
                    },
                    "comment": {
                        "type": "string"
                    },
                    "created_on": {
                        "type": "date",
                        "format": "dateOptionalTime"
                    },
                    "custom": {
                        "properties": {
                            "key": {
                                "type": "string"
                            },
                            "value": {
                                "type": "string"
                            }
                        }
                    },
                    "gender": {
                        "type": "string"
                    },
                    "name": {
                        "type": "string"
                    },
                    "qq": {
                        "type": "string"
                    },
                    "tel": {
                        "type": "string"
                    },
                    "updated_on": {
                        "type": "date",
                        "format": "dateOptionalTime"
                    },
                }
            }
        },
        "settings": {
            "index": {
                "creation_date": "1458832279465",
                "uuid": "Fbmthc3lR0ya51zCnWidYg",
                "number_of_replicas": "1",
                "number_of_shards": "5",
                "version": {
                    "created": "1070299"
                }
            }
        },
        "warmers": {}
    }
}

the mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings" : {
    "user_data" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "analyzer" : "ik"
        },
        "age" : {
          "type" : "integer"
        },
        "gender": {
          "type" : "string"
        },
        "qq" : {
          "type" : "string"
        },
        "email" : {
          "type" : "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "tel" : {
          "type" : "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        },
        "address" : {
          "type": "string",
          "analyzer" : "ik"
        },
        "comment" : {
          "type" : "string",
          "analyzer" : "ik"
        },
        "created_on" : {
          "type" : "date",
          "format" : "dateOptionalTime"
        },
        "updated_on" : {
          "type" : "date",
          "format" : "dateOptionalTime"
        },
        "custom": {
          "type" : "nested",
          "properties" : {
            "key" : {
              "type" : "string"
            },
            "value" : {
              "type" : "string"
            }
          }
        }
      }
    }
  }
}

解决方案

An easy way to do this is to create a custom analyzer which makes use of the n-gram token filter for emails (=> see below index_email_analyzer and search_email_analyzer + email_url_analyzer for exact email matching) and edge-ngram token filter for phones (=> see below index_phone_analyzer and search_phone_analyzer).

The full index definition is available below.

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_url_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": [ "trim" ]
        },
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "1",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "phone": {
          "type": "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        }
      }
    }
  }
}

Now, let's dissect it one bit after another.

For the phone field, the idea is to index phone values with index_phone_analyzer, which uses an edge-ngram tokenizer in order to index all prefixes of the phone number. So if your phone number is 1362435647, the following tokens will be produced: 1, 13, 136, 1362, 13624, 136243, 1362435, 13624356, 13624356, 136243564, 1362435647.

Then when searching we use another analyzer search_phone_analyzer which will simply take the input number (e.g. 136) and match it against the phone field using a simple match or term query:

POST myindex
{ 
    "query": {
        "term": 
            { "phone": "136" }
    }
}

For the email field, we proceed in a similar way, in that we index the email values with the index_email_analyzer, which uses an ngram token filter, which will produce all possible tokens of varying length (between 1 and 20 chars) that can be taken from the email value. For instance: john@gmail.com will be tokenized to j, jo, joh, ..., gmail.com, ..., john@gmail.com.

Then when searching, we'll use another analyzer called search_email_analyzer which will take the input and try to match it against the indexed tokens.

POST myindex
{ 
    "query": {
        "term": 
            { "email": "@gmail.com" }
    }
}

The email_url_analyzer analyzer is not used in this example but I've included it just in case you need to match on the exact email value.

这篇关于如何用Elasticsearch模糊匹配电子邮件或电话？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何用Elasticsearch模糊匹配电子邮件或电话？ [英] How to fuzzy match email or telephone by Elasticsearch?

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

如何用Elasticsearch模糊匹配电子邮件或电话？ [英] How to fuzzy match email or telephone by Elasticsearch?

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭