弹性搜索按唯一的子串（电子邮件域）聚合的查询 [英] Elasticsearch Query aggregated by unique substrings (email domain)

查看：147 发布时间：2017/8/7 2:20:59 elasticsearch aggregation

本文介绍了弹性搜索按唯一的子串（电子邮件域）聚合的查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个弹性搜索查询查询索引，然后根据特定字段 sender_not_analyzed 进行聚合。然后我在同一个字段 sender_not_analyzed 中使用术语聚合，它返回顶部发件人的桶。我的查询目前是：

  {
size：0，
query：{
regexp：{
sender_not_analyzed：。* [@]。*
} 
}，
aggs：{
发送者统计信息：{
条款：{
字段：sender_not_analyzed
} 
} 
} 
}

它返回的样式如下：

 aggregate：{
sender-stats：{
buckets：[
 {
key：< mike@fizzbuzz.com> @ MISSING_DOMAIN>，
doc_count：5017 
}，
 {
key：jon.doe@foo.com，
doc_count：3963 
}，
 {
key：jane.doe@foo.com，
doc_count：2857 
 }，
 {
key：jon.doe@bar.com，
 doc_count：1544 
}

如何编写一个聚合，每个唯一的电子邮件域的桶，例如 foo.com 将具有（3963 + 2857）6820的 doc_count 我可以用正则表达式聚合来完成这个，还是需要写一些自定义分析器来将字符串分割成@到字符串的末尾？

解决方案

这很晚，但我认为这可以通过使用 pattern_replace char filter ，您可以使用正则表达式，这是我的设置

  POST email_index 
 {
设置：{
analysis：{
analyzer：{
my_custom_analyzer：{
char_filter：[
domain
 ]，
tokenizer：关键字，
过滤器：[
小写，
asciifolding
] 
} 
 
char_filter：{
domain：{
type：pattern_replace，
pattern ，
替换：$ 1
} 
} 
} 
}，
映射：{
your_类型：{
properties：{
domain：{
type：string，
analyzer：my_custom_analyzer
} ，
sender_not_analyzed：{
type：string，
index：not_analyzed，
copy_to：domain
} 
} 
} 
} 
}

域名过滤器将捕获域名，我们需要使用关键字标记器以获取域名，我使用小写过滤器，但它是如果你想要使用它，你最好。使用 copy_to 参数复制 sender_not_analyzed 到域字段，虽然 _source 字段不会被修改为包含此值，但是我们可以查询它。

  GET email_index / _search 
 {
大小：0，
查询：{
regexp：{
sender_not_analyzed：。* [@]。*
} 
} ，
aggs：{
sender-stats：{
terms：{
field：domain
} 
 } 
} 
}

这将给你所需的结果。 >

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:

{
   "size": 0,
   "query": {
      "regexp": {
         "sender_not_analyzed": ".*[@].*"
      }
   },
   "aggs": {
      "sender-stats": {
         "terms": {
            "field": "sender_not_analyzed"
         }
      }
   }
}

which returns buckets that look like:

"aggregations": {
      "sender-stats": {
         "buckets": [
            {
               "key": "<Mike <mike@fizzbuzz.com>@MISSING_DOMAIN>",
               "doc_count": 5017
            },
            {
               "key": "jon.doe@foo.com",
               "doc_count": 3963
            },
            {
               "key": "jane.doe@foo.com",
               "doc_count": 2857
            },
            {
              "key": "jon.doe@bar.com",
              "doc_count":1544
            }

How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the @ to the end of string?

解决方案

This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup

POST email_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "char_filter": [
            "domain"
          ],
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "domain": {
          "type": "pattern_replace",
          "pattern": ".*@(.*)",
          "replacement": "$1"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "domain": {
          "type": "string",
          "analyzer": "my_custom_analyzer"
        },
        "sender_not_analyzed": {
          "type": "string",
          "index": "not_analyzed",
          "copy_to": "domain"
        }
      }
    }
  }
}

Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.

GET email_index/_search
{
  "size": 0,
  "query": {
    "regexp": {
      "sender_not_analyzed": ".*[@].*"
    }
  },
  "aggs": {
    "sender-stats": {
      "terms": {
        "field": "domain"
      }
    }
  }
}

This will give you desired result.

这篇关于弹性搜索按唯一的子串（电子邮件域）聚合的查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

弹性搜索按唯一的子串（电子邮件域）聚合的查询 [英] Elasticsearch Query aggregated by unique substrings (email domain)

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

弹性搜索按唯一的子串（电子邮件域）聚合的查询 [英] Elasticsearch Query aggregated by unique substrings (email domain)

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭