防止Google Dataflow中融合的最佳方法? [英] Best way to prevent fusion in Google Dataflow?
问题描述
来自: https://cloud.google.com /dataflow/service/dataflow-service-desc#preventing-fusion
您可以在第一个ParDo之后插入GroupByKey并取消分组.数据流服务永远不会在聚合中融合ParDo操作.
You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.
这是我在python中想到的-这是否合理/是否有更简单的方法?
This is what I came up with in python - is this reasonable / is there a simpler way?
def prevent_fuse(collection):
return (
collection
| beam.Map(lambda x: (x, 1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: (x[0] for v in x[1]))
)
编辑,以回应本·钱伯斯的问题
EDIT, in response to Ben Chambers' question
我们要防止融合,因为我们有一个生成大得多的集合的集合,并且我们需要在更大的集合之间进行并行化.如果融合的话,我只能在更大的系列中只有一名工人.
We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.
推荐答案
Apache Beam SDK 2.3.0 adds the experimental Reshuffle
transform, which is the Python alternative to the Reshuffle.viaRandomKey
operation mentioned by @BenChambers. You can use it in place of your custom prevent_fuse
code.
这篇关于防止Google Dataflow中融合的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!