elasticsearch分组多个字段 [英] elasticsearch group-by multiple fields

查看:118
本文介绍了elasticsearch分组多个字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找在 elasticsearch 中对数据进行分组的最佳方式.Elasticsearch 不支持 sql 中的group by"之类的东西.

I am Looking for the best way to group data in elasticsearch. Elasticsearch doesn't support something like 'group by' in sql.

假设我有 1k 个类别和数百万种产品.您认为渲染完整类别树的最佳方法是什么?当然,您需要一些元数据(图标、链接目标、seo 标题等)和类别的自定义排序.

Lets say I have 1k categories and millions of products. What do you think is the best way to render a complete category tree? Off course you need some metadata (icon, link-target, seo-titles,...) and custom sorting for the categories.

  1. 使用聚合:示例:https://found.no/play/gist/8124563如果您必须按一个字段分组,并且需要一些额外的字段,这看起来很有用.

  1. Using Aggregations: Example: https://found.no/play/gist/8124563 Looks usable if you have to group by one field, and need some extra fields.

在一个 Facet 中使用多个字段(不起作用):示例:https://found.no/play/gist/1aa44e2114975384a7c2这里我们失去了不同字段之间的关系.

Using multiple Fields in a Facet (won't work): Example: https://found.no/play/gist/1aa44e2114975384a7c2 Here we lose the relationship between the different fields.

构建有趣的方面:https://found.no/play/gist/8124810

例如,使用这 3 个解决方案"构建类别树很糟糕.解决方案 1 可以工作(ES 1 现在不稳定)解决方案2不起作用解决方案3 很痛苦,因为感觉很难看,需要准备大量数据,facet 炸了.

For example, building a category tree using these 3 "solutions" sucks. Solution 1 May work (ES 1 isn't stable right now) Solution 2 Doesn't work Solution 3 Is a pain because it feels ugly, you need to prepare a lot of data and the facets blow up.

也许另一种选择是不在 ES 中存储任何类别数据,只存储 idhttps://found.no/play/gist/a53e46c91e2bf077f2e1

Maybe an alternative could be not to store any category data in ES, just the id https://found.no/play/gist/a53e46c91e2bf077f2e1

然后你可以从另一个系统获取关联的类别,比如 redis、memcache 或数据库.

Then you could get the associated category from another system, like redis, memcache or the database.

这会以干净的代码结束,但性能可能会成为问题.例如,从 Memcache/Redis/数据库加载 1k 类别可能会很慢.另一个问题是同步 2 个数据库比同步一个更难.

This would end up in clean code, but the performance could become a problem. For example loading, 1k Categories from Memcache / Redis / a database could be slow. Another problem is that syncing 2 database is harder than syncing one.

您如何处理此类问题?

我很抱歉链接,但我不能在一篇文章中发布超过 2 个.

I am sorry for the links, but I can't post more than 2 in one article.

推荐答案

聚合 API 允许使用子聚合按多个字段进行分组.假设您要按字段field1field2field3 进行分组:

The aggregations API allows grouping by multiple fields, using sub-aggregations. Suppose you want to group by fields field1, field2 and field3:

{
  "aggs": {
    "agg1": {
      "terms": {
        "field": "field1"
      },
      "aggs": {
        "agg2": {
          "terms": {
            "field": "field2"
          },
          "aggs": {
            "agg3": {
              "terms": {
                "field": "field3"
              }
            }
          }          
        }
      }
    }
  }
}

当然,您可以根据需要对任意多个字段进行此操作.

Of course this can go on for as many fields as you'd like.

更新:
为完整起见,以下是上述查询的输出外观.下面还有用于生成聚合查询并将结果展平为字典列表的 Python 代码.

Update:
For completeness, here is how the output of the above query looks. Also below is python code for generating the aggregation query and flattening the result into a list of dictionaries.

{
  "aggregations": {
    "agg1": {
      "buckets": [{
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        },
        {
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": {
          "buckets": [{
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            },
            {
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": {
              "buckets": [{
                "doc_count": <count>,
                "key": <value of field3>
              },
              {
                "doc_count": <count>,
                "key": <value of field3>
              }, ...
              ]
            }, ...
          ]
        }, ...
      ]
    }
  }
}

以下 Python 代码根据给定的字段列表执行分组.如果您指定 include_missing=True,它还包括某些字段缺失的值的组合(如果您拥有 Elasticsearch 2.0 版,由于 这个)

The following python code performs the group-by given the list of fields. I you specify include_missing=True, it also includes combinations of values where some of the fields are missing (you don't need it if you have version 2.0 of Elasticsearch thanks to this)

def group_by(es, fields, include_missing):
    current_level_terms = {'terms': {'field': fields[0]}}
    agg_spec = {fields[0]: current_level_terms}

    if include_missing:
        current_level_missing = {'missing': {'field': fields[0]}}
        agg_spec[fields[0] + '_missing'] = current_level_missing

    for field in fields[1:]:
        next_level_terms = {'terms': {'field': field}}
        current_level_terms['aggs'] = {
            field: next_level_terms,
        }

        if include_missing:
            next_level_missing = {'missing': {'field': field}}
            current_level_terms['aggs'][field + '_missing'] = next_level_missing
            current_level_missing['aggs'] = {
                field: next_level_terms,
                field + '_missing': next_level_missing,
            }
            current_level_missing = next_level_missing

        current_level_terms = next_level_terms

    agg_result = es.search(body={'aggs': agg_spec})['aggregations']
    return get_docs_from_agg_result(agg_result, fields, include_missing)


def get_docs_from_agg_result(agg_result, fields, include_missing):
    current_field = fields[0]
    buckets = agg_result[current_field]['buckets']
    if include_missing:
        buckets.append(agg_result[(current_field + '_missing')])

    if len(fields) == 1:
        return [
            {
                current_field: bucket.get('key'),
                'doc_count': bucket['doc_count'],
            }
            for bucket in buckets if bucket['doc_count'] > 0
        ]

    result = []
    for bucket in buckets:
        records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
        value = bucket.get('key')
        for record in records:
            record[current_field] = value
        result.extend(records)

    return result

这篇关于elasticsearch分组多个字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆