数据流Apache Beam Python作业被逐级卡住 [英] Dataflow Apache beam Python job stuck at Group by step

查看:100
本文介绍了数据流Apache Beam Python作业被逐级卡住的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个数据流作业,该作业从BigQuery中读取并在8 GB of data and result in more than 50,000,000 records.周围进行扫描.现在,我要一步一步地基于键进行分组,并且需要将一列连接起来.但是,当连接列的连接大小超过100 MB之后,为什么我必须在数据流作业中执行该分组依据,因为该分组依据无法在Bigquery level due to row size limit of 100 MB.

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.

现在,从BigQuery读取数据流作业时,它的伸缩性很好,但被卡在G​​roup by Step上,我有2个版本的数据流代码,但是两者都在卡入中. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message

Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message

我希望按状态分组可以在20分钟内完成,但是会停留超过1个小时且永远不会完成

I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished

推荐答案

我自己弄清楚了这件事. 以下是我在管道中所做的2项更改: 1.我在按组分组"之后添加了一个合并功能,请参见屏幕截图

I figured out the thing myself. Below are the 2 changes that I did in my pipeline: 1. I added a Combine function just after the Group by Key, see screenshot

  1. 由于在多个工作服务器上运行时按组进行分组,会进行大量的网络流量交换,并且默认情况下,我们使用的网络不允许网络间的通信,因此我必须创建一个防火墙规则以允许来自一个工人到另一个工人,即ip范围到网络流量.

这篇关于数据流Apache Beam Python作业被逐级卡住的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆