在DataFlow管道中，按组进行简单的计数步骤非常慢 [英] A simple counting step following a group by key is extremely slow in a DataFlow pipeline

查看：79 发布时间：2020/9/3 5:01:22 google-cloud-dataflow apache-beam

本文介绍了在DataFlow管道中，按组进行简单的计数步骤非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个DataFlow管道，它试图建立一个索引(键-值对)并计算一些指标(如每个键的值个数).输入数据总计约60 GB，存储在GCS上，管道中分配了约126个工作程序.每个Stackdriver的所有工人的CPU利用率约为6％.

I have a DataFlow pipeline trying to build an index (key-value pairs) and compute some metrics (like a number of values per key). The input data is about 60 GB total, stored on GCS and the pipeline has about 126 workers allocated. Per Stackdriver all workers have about 6% CPU utilization.

尽管有126名工人，但是管道似乎没有任何进展，并且根据挂墙时间，瓶颈似乎是一个简单的计算步骤，紧随其后.尽管所有其他步骤平均花费不到1个小时，但计数步骤已经花费了50天的挂墙时间.日志中似乎没有所有警告的有用信息.

The pipeline seems to make no progress despite having 126 workers and based on the wall time the bottleneck seems to be a simple counting step that follows a group by. While all other steps have on average less than 1 hour spent in them, the counting step took already 50 days of the wall time. There seems to be no helpful information all warnings in the log.

计数步骤是在WordCount示例中的相应步骤之后执行的:

The counting step was implemented following a corresponding step in the WordCount example:

def count_keywords_per_product(self, key_and_group):
    key, group = key_and_group
    count = 0
    for e in group:
        count += 1

    self.stats.product_counter.inc()
    self.stats.keywords_per_product_dist.update(count)

    return (key, count)

上一步组关键字"是一个简单的beam.GroupByKey()转换.

The preceding step "Group keywords" is a simple beam.GroupByKey() transformation.

请告知可能的原因以及如何对其进行优化.

Please advise what might be the reason and how this can be optimized.

Current resource metrics:
Current vCPUs    126
Total vCPU time      1,753.649 vCPU hr
Current memory   472.5 GB
Total memory time    6,576.186 GB hr
Current PD   3.08 TB
Total PD time    43,841.241 GB hr
Current SSD PD   0 B
Total SSD PD time    0 GB hr
Total Shuffle data processed     1.03 TB
Billable Shuffle data processed      529.1 GB

包括计数在内的流水线步骤如下所示:

The pipeline steps including the counting one can be seen below:

在DataFlow管道中，按组进行简单的计数步骤非常慢 [英] A simple counting step following a group by key is extremely slow in a DataFlow pipeline

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在DataFlow管道中，按组进行简单的计数步骤非常慢 [英] A simple counting step following a group by key is extremely slow in a DataFlow pipeline

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭