子聚合导致数据丢失 [英] Subaggregation leads to missing data

查看：50 发布时间：2021/5/3 20:28:36 elasticsearch elasticsearch-aggregation

本文介绍了子聚合导致数据丢失的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

简短的问题:在执行带有子聚合的查询时，内部聚合为什么在某些情况下会丢失数据?

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?

详细问题:我有一个带有子聚合(存储桶中的存储桶)的搜索查询，如下所示:

Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:

{
    "size": 0,
    "aggs": {
        "outer_docs": {
            "terms": {"size": 20, "field": "field_1_to_aggregate_on"},
            "aggs": {
                "inner_docs": {
                    "terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
                    "aggs": "things to display here"
                }
            }
        }
    }
}

如果执行此查询，对于某些external_docs，我不会收到与之关联的所有inner_docs.在下面的输出中，有三个用于外部文档key_1的内部文档.

If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.

{
    "hits": {
        "total": 9853,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 3,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "3", "doc_count": 1, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

现在，我添加一个查询以单选一个反而会在前20个中使用的external_doc.

Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.

"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}

在这种情况下，我确实获得了所有inner_docs，这些输出在外部文档key_1的七个内部文档下面的输出中.

In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.

{
    "hits": {
        "total": 8,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 8,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "7", "doc_count": 2, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

我已明确指定每个外部文档要10,000个内部文档.是什么使我无法获取所有数据?

I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?

这是我的版本信息:

{
    'build_date': '2018-09-26T13:34:09.098244Z',
    'build_flavor': 'default',
    'build_hash': '04711c2',
    'build_snapshot': False,
    'build_type': 'deb',
    'lucene_version': '7.4.0',
    'minimum_index_compatibility_version': '5.0.0',
    'minimum_wire_compatibility_version': '5.6.0',
    'number': '6.4.2'
}

编辑:经过进一步的研究，我发现问题与子聚合无关，但与聚合本身和分片的使用无关.我已经为Elastic打开了此错误报告:

EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:

子聚合导致数据丢失 [英] Subaggregation leads to missing data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

子聚合导致数据丢失 [英] Subaggregation leads to missing data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭