子聚合导致数据丢失 [英] Subaggregation leads to missing data

查看:50
本文介绍了子聚合导致数据丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短的问题:在执行带有子聚合的查询时,内部聚合为什么在某些情况下会丢失数据?

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?

详细问题:我有一个带有子聚合(存储桶中的存储桶)的搜索查询,如下所示:

Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:

{
    "size": 0,
    "aggs": {
        "outer_docs": {
            "terms": {"size": 20, "field": "field_1_to_aggregate_on"},
            "aggs": {
                "inner_docs": {
                    "terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
                    "aggs": "things to display here"
                }
            }
        }
    }
}

如果执行此查询,对于某些external_docs,我不会收到与之关联的所有inner_docs.在下面的输出中,有三个用于外部文档key_1的内部文档.

If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.

{
    "hits": {
        "total": 9853,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 3,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "3", "doc_count": 1, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

现在,我添加一个查询以单选一个反而会在前20个中使用的external_doc.

Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.

"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}

在这种情况下,我确实获得了所有inner_docs,这些输出在外部文档key_1的七个内部文档下面的输出中.

In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.

{
    "hits": {
        "total": 8,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 8,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "7", "doc_count": 2, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

我已明确指定每个外部文档要10,000个内部文档.是什么使我无法获取所有数据?

I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?

这是我的版本信息:

{
    'build_date': '2018-09-26T13:34:09.098244Z',
    'build_flavor': 'default',
    'build_hash': '04711c2',
    'build_snapshot': False,
    'build_type': 'deb',
    'lucene_version': '7.4.0',
    'minimum_index_compatibility_version': '5.0.0',
    'minimum_wire_compatibility_version': '5.6.0',
    'number': '6.4.2'
}

编辑:经过进一步的研究,我发现问题与子聚合无关,但与聚合本身和分片的使用无关.我已经为Elastic打开了此错误报告:

EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:

推荐答案

事实证明,问题并非归因于子聚合,而这是ElasticSearch的实际功能.我们使用5个分片,使用分片时,聚合仅返回近似结果.

It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.

我们已使此问题可再现,并将其发布在

We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.

我们还了解到,仅使用1个分片即可解决此问题,并且在不可能的情况下,参数 shard_size 可以缓解此问题.

We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.

这篇关于子聚合导致数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆