Apache Beam/数据流改组 [英] Apache Beam/Dataflow Reshuffle

查看:63
本文介绍了Apache Beam/数据流改组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

org.apache.beam.sdk.transforms.Reshuffle的目的是什么?在文档中,目的定义为:

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:

一个PTransform,它返回一个等效于其输入的PCollection,但在操作上提供了GroupByKey的一些副作用,特别防止周围变换的融合,按ID进行检查点和重复数据删除.

A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id.

防止周围变换融合的好处是什么?我认为融合是一种优化措施,可以防止不必要的步骤.实际用例会有所帮助.

What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.

推荐答案

在几种情况下,您可能需要重新整理数据.以下不是详尽的清单,但应给您和您为什么可以改组的想法:

There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:

这意味着在您的ParDo之后,并行度增加了.如果您在此处没有破坏融合,那么您的管道将无法将数据拆分到多台计算机中进行处理.

This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.

考虑DoFn的极端情况,该DoFn会为每个输入元素生成一百万个输出元素.考虑此ParDo在其输入中接收10个元素.如果您不破坏这种高扇形的ParDo及其下游变换之间的融合,尽管您将拥有数百万个元素,但它只能在10台机器上运行.

Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.

  • 诊断此问题的一种好方法是查看输入PCollection中的元素数量与输出PCollection中的元素数量.如果后者明显大于第一个,则您可能要考虑添加重新排列.
  • A good way to diagnose this is looking at the number of elements in an input PCollection vs the number of elements of an output PCollection. If the latter is significantly larger than the first, then you may want to consider adding a reshuffle.

想象一下,您的管道消耗9个10MB的文件和1个10GB的文件.如果每个文件都是由一台计算机读取的,则您的一台计算机将比其他计算机拥有更多的数据.

Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.

如果不重新组合此数据,则在管道运行时,大多数计算机将处于空闲状态.改组它使您可以重新平衡要在计算机之间更均匀地处理的数据.

If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.

  • 诊断此问题的一种好方法是查看您的管道中有多少工人正在执行工作.如果管道很慢,并且只有一个工作人员在处理数据,那么您可以从改组中受益.
  • A good way to diagnose this is by looking at how many workers are executing work in your pipeline. If the pipeline is slow, and there is only one worker processing data, then you can benefit from a reshuffle.

这篇关于Apache Beam/数据流改组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆