Apache Beam DataFlow runner 抛出设置错误 [英] Apache beam DataFlow runner throwing setup error

查看:36
本文介绍了Apache Beam DataFlow runner 抛出设置错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 Beam Python SDK 构建数据管道并尝试在 Dataflow 上运行,但出现以下错误,

We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

但找不到详细的工作启动日志.

But could not find detailed worker-startup logs.

我们尝试增加内存大小、工作人员数量等,但仍然出现相同的错误.

We tried increasing memory size, worker count etc, but still getting the same error.

这是我们使用的命令,

python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2

管道片段,

data = pipeline | "load data" >> beam.io.Read(    
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)

data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)

以上管道只是从 BigQuery 加载数据并根据某些列值进行过滤.此管道在 DirectRunner 中的作用就像一个魅力,但在 Dataflow 上却失败了.

Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.

我们是否犯了任何明显的设置错误?还有其他人遇到同样的错误吗?我们可以使用一些帮助来解决这个问题.

Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.

我们的管道代码分布在多个文件中,因此我们创建了一个 python 包.我们通过传递 --setup_file 参数而不是 --requirements_file 来解决设置错误问题.

Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.

推荐答案

我们通过向数据流发送一组不同的参数解决了这个设置错误问题.我们的代码分布在多个文件中,因此必须为其创建一个包.如果我们使用 --requirements_file,作业将开始,但最终会失败,因为它无法在工作程序中找到包.Beam Python SDK 有时不会为这些抛出明确的错误消息,而是会重试作业并失败.要让您的代码与包一起运行,您需要传递 --setup_file 参数,其中列出了依赖项.确保由 python setup.py sdist 命令创建的包包含管道代码所需的所有文件.

We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.

如果您有一个私人托管的 python 包依赖项,则将 --extra_package 与 package.tar.gz 文件的路径一起传递.更好的方法是存储在 GCS 存储桶中并在此处传递路径.

If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.

我编写了一个示例项目来开始在 Dataflow 上使用 Apache Beam Python SDK - https://github.com/RajeshHegde/apache-beam-example

I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example

在这里阅读 - https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

这篇关于Apache Beam DataFlow runner 抛出设置错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆