由于工作人员失去联系,谷歌云数据流在组合功能中失败 [英] Google Cloud Dataflow fails in combine function due to worker losing contact

查看:22
本文介绍了由于工作人员失去联系,谷歌云数据流在组合功能中失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据流在我的组合函数中一直失败,除了单个条目之外,日志中没有报告任何错误:

My Dataflow consistently fails in my combine function with no errors reported in the logs beyond a single entry of:

 A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service.

我使用的是 Apache Beam Python SDK 2.4.0.我已尝试使用CombinePerKey 和CombineGlobally 执行此步骤.在这两种情况下,管道在组合功能中都失败了.管道在使用少量数据运行时完成.

I am using the Apache Beam Python SDK 2.4.0. I have tried performing this step with both CombinePerKey and CombineGlobally. The pipeline failed in the combine function in both cases. The pipeline completes when running with a smaller amount of data.

我是否耗尽了员工资源而没有被告知?什么会导致工作人员与服务失去联系?

Am I exhausting worker resources and not being told about it? What can cause a worker to lose contact with the service?

更新:

使用 n1-highmem-4 工人给了我同样的失败.当我检查 Stackdriver 时,我没有看到任何错误,但看到了三种警告:No session file foundRefusing to splitProcessing lull.我的输入集合大小说它有 17,000 个元素,分布在大约 60 MB 中,但是 Stackdriver 有一个声明说我在单个工作人员上使用了大约 25 GB,这正在接近最大值.对于此输入,在我的 CombineFn 中创建的每个累加器应占用大约 150 MB 内存.我的管道是否创建了太多累加器并耗尽其内存?如果是这样,我如何告诉它更频繁地合并累加器或限制创建的数量?

Using n1-highmem-4 workers gives me the same failure. When I check Stackdriver I see no errors, but three kinds of warnings: No session file found, Refusing to split, and Processing lull. My input collection size says it's 17,000 elements spread across ~60 MB, but Stackdriver has a statement saying I'm using ~25 GB on a single worker which is getting towards the max. For this input, each accumulator created in my CombineFn should take roughly 150 MB memory. Is my pipeline creating too many accumulators and exhausting its memory? If so, how can I tell it to merge accumulators more often or limit the number created?

我确实有一个错误日志条目,证实我的工作人员因 OOM 被杀.它只是没有标记为工作错误,这是 Dataflow 监视器的默认过滤.

I do have an error log entry verifying my worker was killed due to OOM. It just isn't tagged as a worker error which is the default filtering for the Dataflow monitor.

管道定义类似于:

table1 = (p | "Read Table1" >> beam.io.Read(beam.io.BigQuerySource(query=query))
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))
table2 = (p | "Read Table2" >> beam.io.Read(beam.io.BigQuerySource(query=query))
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))

merged = ({"table1": table1, "table2": table2}
     | "Join" >> beam.CoGroupByKey()
     | "Reshape" >> beam.ParDo(ReshapeData())
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))
     | "Build matrix" >> beam.CombinePerKey(MatrixCombiner())  # Dies here
     | "Write matrix" >> beam.io.avroio.WriteToAvro())

推荐答案

使用更少的 worker 运行会导致更少的累加器和管道的成功完成.

Running with fewer workers leads to less accumulators and successful completion of the pipeline.

这篇关于由于工作人员失去联系,谷歌云数据流在组合功能中失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆