防止Google Dataflow中融合的最佳方法? [英] Best way to prevent fusion in Google Dataflow?

查看：66 发布时间：2020/11/18 1:50:13 google-cloud-dataflow

本文介绍了防止Google Dataflow中融合的最佳方法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

来自: https://cloud.google.com /dataflow/service/dataflow-service-desc#preventing-fusion

您可以在第一个ParDo之后插入GroupByKey并取消分组.数据流服务永远不会在聚合中融合ParDo操作.

You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.

这是我在python中想到的-这是否合理/是否有更简单的方法?

This is what I came up with in python - is this reasonable / is there a simpler way?

def prevent_fuse(collection):
    return (
        collection
        | beam.Map(lambda x: (x, 1))
        | beam.GroupByKey()
        | beam.FlatMap(lambda x: (x[0] for v in x[1]))
        )

编辑，以回应本·钱伯斯的问题

EDIT, in response to Ben Chambers' question

我们要防止融合，因为我们有一个生成大得多的集合的集合，并且我们需要在更大的集合之间进行并行化.如果融合的话，我只能在更大的系列中只有一名工人.

We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.

防止Google Dataflow中融合的最佳方法? [英] Best way to prevent fusion in Google Dataflow?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

防止Google Dataflow中融合的最佳方法? [英] Best way to prevent fusion in Google Dataflow?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭