Apache Beam/数据流重组 [英] Apache Beam/Dataflow Reshuffle

查看:39
本文介绍了Apache Beam/数据流重组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

org.apache.beam.sdk.transforms.Reshuffle 的目的是什么?在文档中,目的被定义为:

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:

一个 PTransform,它返回一个等价于它的输入的 PCollection 但是在操作上提供了 GroupByKey 的一些副作用,在特别是防止周围变换的融合,通过 id 进行检查点和重复数据删除.

A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id.

防止周围变换融合有什么好处?我认为融合是一种优化,以防止不必要的步骤.实际用例会有所帮助.

What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.

推荐答案

在几种情况下,您可能想要重新排列数据.以下不是一个详尽的列表,但应该可以让您了解为什么可能会重新洗牌:

There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:

这意味着在 ParDo 之后并行度增加.如果你不在这里打破融合,你的管道将无法将数据拆分到多台机器上进行处理.

This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.

考虑为每个输入元素生成一百万个输出元素的 DoFn 的极端情况.考虑到此 ParDo 在其输入中接收 10 个元素.如果你不打破这个高扇出 ParDo 与其下游转换之间的融合,它只能在 10 台机器上运行,尽管你将拥有数百万个元素.

Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.

  • 诊断此问题的一个好方法是查看输入 PCollection 中的元素数量与输出 PCollection 的元素数量.如果后者明显大于第一个,那么您可能需要考虑添加重新洗牌.
  • A good way to diagnose this is looking at the number of elements in an input PCollection vs the number of elements of an output PCollection. If the latter is significantly larger than the first, then you may want to consider adding a reshuffle.

假设您的管道消耗 9 个 10MB 的文件和一个 10GB 的文件.如果每个文件都由一台机器读取,那么您将拥有一台机器的数据比其他机器多得多.

Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.

如果你不重新整理这些数据,当你的管道运行时,你的大部分机器都将处于空闲状态.重新排列它可以让您重新平衡要在机器之间更均匀地处理的数据.

If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.

  • 诊断此问题的一个好方法是查看有多少工作人员正在您的管道中执行工作.如果管道速度很慢,并且只有一个工作人员处理数据,那么您可以从重新洗牌中受益.
  • A good way to diagnose this is by looking at how many workers are executing work in your pipeline. If the pipeline is slow, and there is only one worker processing data, then you can benefit from a reshuffle.

这篇关于Apache Beam/数据流重组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆