防止Google Dataflow中融合的最佳方法? [英] Best way to prevent fusion in Google Dataflow?

查看:66
本文介绍了防止Google Dataflow中融合的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自: https://cloud.google.com /dataflow/service/dataflow-service-desc#preventing-fusion

您可以在第一个ParDo之后插入GroupByKey并取消分组.数据流服务永远不会在聚合中融合ParDo操作.

You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.

这是我在python中想到的-这是否合理/是否有更简单的方法?

This is what I came up with in python - is this reasonable / is there a simpler way?

def prevent_fuse(collection):
    return (
        collection
        | beam.Map(lambda x: (x, 1))
        | beam.GroupByKey()
        | beam.FlatMap(lambda x: (x[0] for v in x[1]))
        )

编辑,以回应本·钱伯斯的问题

EDIT, in response to Ben Chambers' question

我们要防止融合,因为我们有一个生成大得多的集合的集合,并且我们需要在更大的集合之间进行并行化.如果融合的话,我只能在更大的系列中只有一名工人.

We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.

推荐答案

Apache Beam SDK 2.3.0添加了实验性的

Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey operation mentioned by @BenChambers. You can use it in place of your custom prevent_fuse code.

这篇关于防止Google Dataflow中融合的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆