数据流:没有工作人员活动 [英] Dataflow: No Worker Activity

查看:34
本文介绍了数据流:没有工作人员活动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在从 AI Platform Notebook 运行相对普通的 Dataflow 作业时遇到了一些问题(该作业旨在从 BigQuery 获取数据 > 清理和准备 > 在 GCS 中写入 CSV):

I'm having a few problems running a relatively vanilla Dataflow job from an AI Platform Notebook (the job is meant to take data from BigQuery > cleanse and prep > write to a CSV in GCS):

options = {'staging_location': '/staging/location/',
           'temp_location': '/temp/location/',
           'job_name': 'dataflow_pipeline_job',
           'project': PROJECT,
           'teardown_policy': 'TEARDOWN_ALWAYS',
           'max_num_workers': 3,
           'region': REGION,
           'subnetwork': 'regions/<REGION>/subnetworks/<SUBNETWORK>',
           'no_save_main_session': True}
opts = beam.pipeline.PipelineOptions(flags=[], **options)  
p = beam.Pipeline('DataflowRunner', options=opts)
(p 
 | 'read' >> beam.io.Read(beam.io.BigQuerySource(query=selquery, use_standard_sql=True))
 | 'csv' >> beam.FlatMap(to_csv)
 | 'out' >> beam.io.Write(beam.io.WriteToText('OUTPUT_DIR/out.csv')))
p.run()

从堆栈驱动程序返回的错误:

Error returned from stackdriver:

工作流程失败.原因:Dataflow 作业似乎卡住了,因为在过去 1 小时内没有看到任何工作程序活动.您可以在 https://cloud.google.com/dataflow/support.

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.

以下警告:

S01:eval_out/WriteToText/Write/WriteImpl/DoOnce/Read+out/WriteToText/Write/WriteImpl/InitializeWrite 失败.

S01:eval_out/WriteToText/Write/WriteImpl/DoOnce/Read+out/WriteToText/Write/WriteImpl/InitializeWrite failed.

不幸的是,除此之外别无他物.其他注意事项:

Unfortunately not much else other than that. Other things to note:

  • 作业在本地运行,没有任何错误
  • 网络以自定义模式运行,但为默认网络
  • Python 版本 == 3.5.6
  • Python Apache Beam 版本 == 2.16.0
  • AI Platform Notebook 实际上是一个 GCE 实例,顶部部署了深度学习 VM 映像(具有容器优化的操作系统),然后我们使用端口转发来访问 Jupyter 环境
  • 请求作业的服务帐号(Compute Engine 默认服务帐号)具有完成此任务所需的必要权限
  • 笔记本实例、数据流作业、GCS 存储桶都在 europe-west1 中
  • 我还尝试在标准 AI Platform Notebook 上运行它,并且还是一样的问题.

任何帮助将不胜感激!如果我可以提供任何其他有用的信息,请告诉我.

Any help would be much appreciated! Please let me know if there is any other info I can provide which will help.

我意识到我的错误与以下相同:

I've realised that my error is the same as the following:

为什么数据流步骤没有启动?

我的工作卡住的原因是写入 gcs 步骤首先运行,即使它应该最后运行.有关如何解决此问题的任何想法?

The reason my job has gotten stuck is because the write to gcs step runs first even though it is meant to run last. Any ideas on how to fix this?

推荐答案

在检查代码时,我注意到所使用的WriteToText 转换"的语法与 Apache Beam 文档中建议的语法不匹配.

Upon code inspection, I noticed that the syntax of the ‘WriteToText transform’ used does not match the one suggested in the Apache beam docs.

请遵循 此处.

建议的解决方法是考虑使用批处理模式下可用的 BQ 到 CSV 文件导出选项.

The suggested workaround is to consider using BQ to CSV file export option available in batch mode.

还有更多可用的导出选项.完整列表可以在数据格式和压缩类型"文档这里.

There are even more export options available. The full list can be found in "the data formats and compression types" documentation here.

这篇关于数据流:没有工作人员活动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆