数据流 Apache 光束 Python 作业逐步卡在 Group 中 [英] Dataflow Apache beam Python job stuck at Group by step

查看:32
本文介绍了数据流 Apache 光束 Python 作业逐步卡在 Group 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个数据流作业,它从 BigQuery 读取并扫描了大约 8 GB 的数据并产生了超过 50,000,000 条记录.现在在分组中,我想根据一个键和需要连接一列.但是在连接列的连接大小变得超过 100 MB 之后,为什么我必须在数据流作业中执行该 group by,因为由于 100 MB 的行大小限制,该 group by 无法在 Bigquery 级别完成.

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.

现在数据流作业在从 BigQuery 读取时扩展良好但卡在 Group by Step ,我有 2 个版本的数据流代码,但两者都卡在 group by step .当我检查堆栈驱动程序日志时,它说,处理停留在静止状态超过 1010 秒时间(类似的消息)并且拒绝拆分 GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader 对象在 0x7f618b406358>一种消息

Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message

我希望按州分组在 20 分钟内完成,但卡住了 1 个多小时,永远不会完成

I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished

推荐答案

我自己解决了这个问题.以下是我在管道中所做的 2 个更改:1.我在Group by Key之后添加了一个Combine功能,见截图

I figured out the thing myself. Below are the 2 changes that I did in my pipeline: 1. I added a Combine function just after the Group by Key, see screenshot

  1. 由于 Group by key 在多个 worker 上运行时会进行大量网络流量交换,并且默认情况下我们使用的网络不允许网络间通信,因此我必须创建防火墙规则以允许来自一个worker到另一个worker,即IP范围到网络流量.

这篇关于数据流 Apache 光束 Python 作业逐步卡在 Group 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆