弹性搜索按唯一的子串(电子邮件域)聚合的查询 [英] Elasticsearch Query aggregated by unique substrings (email domain)
问题描述
我有一个弹性搜索查询查询索引,然后根据特定字段 sender_not_analyzed
进行聚合。然后我在同一个字段 sender_not_analyzed
中使用术语聚合,它返回顶部发件人的桶。我的查询目前是:
{
size:0,
query:{
regexp:{
sender_not_analyzed:。* [@]。*
}
},
aggs:{
发送者统计信息:{
条款:{
字段:sender_not_analyzed
}
}
}
}
它返回的样式如下:
aggregate:{
sender-stats:{
buckets:[
{
key:< mike@fizzbuzz.com> @ MISSING_DOMAIN>,
doc_count:5017
},
{
key:jon.doe@foo.com,
doc_count:3963
},
{
key:jane.doe@foo.com,
doc_count:2857
},
{
key:jon.doe@bar.com,
doc_count:1544
}
如何编写一个聚合,每个唯一的电子邮件域的桶,例如 foo.com
将具有(3963 + 2857)6820的 doc_count
我可以用正则表达式聚合来完成这个,还是需要写一些自定义分析器来将字符串分割成@到字符串的末尾?
这很晚,但我认为这可以通过使用 pattern_replace char filter ,您可以使用正则表达式
,这是我的设置
POST email_index
{
设置:{
analysis:{
analyzer:{
my_custom_analyzer:{
char_filter:[
domain
],
tokenizer:关键字,
过滤器:[
小写,
asciifolding
]
}
char_filter:{
domain:{
type:pattern_replace,
pattern ,
替换:$ 1
}
}
}
},
映射:{
your_类型:{
properties:{
domain:{
type:string,
analyzer:my_custom_analyzer
} ,
sender_not_analyzed:{
type:string,
index:not_analyzed,
copy_to:domain
}
}
}
}
}
域名过滤器
将捕获域名,我们需要使用关键字标记器以获取域名,我使用小写
过滤器,但它是如果你想要使用它,你最好。使用 copy_to 参数复制 sender_not_analyzed
到域
字段,虽然 _source
字段不会被修改为包含此值,但是我们可以查询它。
GET email_index / _search
{
大小:0,
查询:{
regexp:{
sender_not_analyzed:。* [@]。*
}
} ,
aggs:{
sender-stats:{
terms:{
field:domain
}
}
}
}
这将给你所需的结果。 >
I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed
. I then use a term aggregation on that same field sender_not_analyzed
which returns buckets for the top "senders". My query is currently:
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[@].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "sender_not_analyzed"
}
}
}
}
which returns buckets that look like:
"aggregations": {
"sender-stats": {
"buckets": [
{
"key": "<Mike <mike@fizzbuzz.com>@MISSING_DOMAIN>",
"doc_count": 5017
},
{
"key": "jon.doe@foo.com",
"doc_count": 3963
},
{
"key": "jane.doe@foo.com",
"doc_count": 2857
},
{
"key": "jon.doe@bar.com",
"doc_count":1544
}
How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com
would have a doc_count
of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the @ to the end of string?
This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex
, This is my setup
POST email_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"domain"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"domain": {
"type": "pattern_replace",
"pattern": ".*@(.*)",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"domain": {
"type": "string",
"analyzer": "my_custom_analyzer"
},
"sender_not_analyzed": {
"type": "string",
"index": "not_analyzed",
"copy_to": "domain"
}
}
}
}
}
Here domain char filter
will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase
filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed
to domain
field, although _source
field won't be modified to include this value but we can query it.
GET email_index/_search
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[@].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "domain"
}
}
}
}
This will give you desired result.
这篇关于弹性搜索按唯一的子串(电子邮件域)聚合的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!