在 DataFlow 管道中,按键分组后的简单计数步骤非常慢 [英] A simple counting step following a group by key is extremely slow in a DataFlow pipeline

查看:16
本文介绍了在 DataFlow 管道中,按键分组后的简单计数步骤非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFlow 管道,试图构建索引(键值对)并计算一些指标(例如每个键的多个值).输入数据总共约 60 GB,存储在 GCS 上,管道分配了约 126 个工作线程.根据 Stackdriver,所有工作人员的 CPU 利用率约为 6%.

I have a DataFlow pipeline trying to build an index (key-value pairs) and compute some metrics (like a number of values per key). The input data is about 60 GB total, stored on GCS and the pipeline has about 126 workers allocated. Per Stackdriver all workers have about 6% CPU utilization.

尽管有 126 名工作人员,但管道似乎没有任何进展,并且基于挂墙时间,瓶颈似乎是一个简单的计数步骤,跟随一组.虽然所有其他步骤平均花费的时间不到 1 小时,但计算步骤已经花费了 50 天的时间.日志中的所有警告似乎都没有有用的信息.

The pipeline seems to make no progress despite having 126 workers and based on the wall time the bottleneck seems to be a simple counting step that follows a group by. While all other steps have on average less than 1 hour spent in them, the counting step took already 50 days of the wall time. There seems to be no helpful information all warnings in the log.

计数步骤是按照 WordCount 示例中的相应步骤实现的:

The counting step was implemented following a corresponding step in the WordCount example:

def count_keywords_per_product(self, key_and_group):
    key, group = key_and_group
    count = 0
    for e in group:
        count += 1

    self.stats.product_counter.inc()
    self.stats.keywords_per_product_dist.update(count)

    return (key, count)

前面的分组关键字"步骤是一个简单的 beam.GroupByKey() 转换.

The preceding step "Group keywords" is a simple beam.GroupByKey() transformation.

请告知可能是什么原因以及如何优化.

Please advise what might be the reason and how this can be optimized.

Current resource metrics:
Current vCPUs    126
Total vCPU time      1,753.649 vCPU hr
Current memory   472.5 GB
Total memory time    6,576.186 GB hr
Current PD   3.08 TB
Total PD time    43,841.241 GB hr
Current SSD PD   0 B
Total SSD PD time    0 GB hr
Total Shuffle data processed     1.03 TB
Billable Shuffle data processed      529.1 GB

包括计数一的流水线步骤如下所示:

The pipeline steps including the counting one can be seen below:

推荐答案

此处对每个键求和的最佳方法是使用组合操作.原因是它可以缓解有热键的问题.

The best way of having a sum per key here is to use a combine operation. The reason is that it can alleviate the problem of having hot keys.

尝试将您的 GroupByKey + ParDo 替换为 beam.combiners.Count.PerKey,或适合您用例的类似组合转换.

Try replacing your GroupByKey + ParDo with a beam.combiners.Count.PerKey, or a similar combine transform that suits your use case.

这篇关于在 DataFlow 管道中,按键分组后的简单计数步骤非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆