Dataflow中的自定义Apache Beam Python版本 [英] Custom Apache Beam Python version in Dataflow

查看:104
本文介绍了Dataflow中的自定义Apache Beam Python版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以在Google Dataflow中运行自定义的Apache Beam Python版本.公共存储库中不可用的版本(撰写本文时:0.6.0和2.0.0).例如,来自Apache Beam官方存储库的HEAD版本,或与此相关的特定标签.

I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. A version that is not available in the public repositories (as of this writing: 0.6.0 and 2.0.0). For example, the HEAD version from the official repository of Apache Beam, or a specific tag for that matter.

我知道按照官方问题,以了解如何针对其他一些脚本执行此操作.甚至还有一个GIST 指南.

I am aware of the possibility of packing custom packages (private local ones for example) as described in the official documentation. There are answered are questions here on how to do this for some other scripts. And there is even a GIST guiding on this.

但是我还没有设法获得当前Apache Beam开发版本(或带标签的版本),该版本可在其官方存储库的master分支中获得,以打包并通过我的脚本发送到Google Dataflow. 例如,对于最新的可用标签,PiP要处理的链接将是:git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python我得到这样的东西:

But I have not managed to get the current Apache Beam developing version (or a tagged one) that is available in the master branch of its official repository to get packaged and sent along my script to Google Dataflow. For example, for the latest available tag, whose link for PiP to process would be: git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python I get something like this:

INFO:root:Executing command: ['.../bin/python', '-m', 'pip', 'install', '--download', '/var/folders/nw/m_035l9d7f1dvdbd7rr271tcqkj80c/T/tmpJhCkp8', 'apache-beam==2.1.0', '--no-binary', ':all:', '--no-deps']
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting apache-beam==2.1.0
  Could not find a version that satisfies the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)
No matching distribution found for apache-beam==2.1.0

有什么想法吗? (我想知道是否有可能,因为Google Dataflow可能已经修复了可以运行到官方发布版本的Apache Beam版本.)

Any ideas? (I am wondering if it is even possible since Google Dataflow may have fixed the versions of Apache Beam that can run to the official released ones).

推荐答案

在我一直在帮助的一个Apache Beam的JIRA上得到这个问题的答案时,我会回答自己.

I will answer myself as I got the answer of this question at one Apache Beam's JIRA I have been helping with.

如果要在Google Cloud Dataflow中使用自定义的Apache Beam Python版本(即使用--runner DataflowRunner运行管道,则在运行管道时必须使用选项--sdk_location <apache_beam_v1.2.3.tar.gz>;其中<apache_beam_v1.2.3.tar.gz>是您要使用的相应打包版本的位置.

If you want to use a custom Apache Beam Python version in Google Cloud Dataflow (that is, run your pipeline with the --runner DataflowRunner, you must use the option --sdk_location <apache_beam_v1.2.3.tar.gz> when you run your pipeline; where <apache_beam_v1.2.3.tar.gz> is the location of the corresponding packaged version that you want to use.

例如,在撰写本文时,如果您已签出Apache Beam的git 版本存储库,您必须先通过使用cd beam/sdks/python导航到Python SDK来打包存储库,然后运行python setup.py sdist(将在dist子目录中创建压缩的tar文件).

For example, as of this writing, if you have checked out the HEAD version of the Apache Beam's git repository, you have to first package the repository by navigating to the Python SDK with cd beam/sdks/python and then run python setup.py sdist (a compressed tar file will be created in the distsubdirectory).

此后,您可以像这样运行管道:

Thereafter you can run your pipeline like this:

python your_pipeline.py [...your_options...] --sdk_location beam/sdks/python/dist/apache-beam-2.2.0.dev0.tar.gz

Google Cloud Dataflow将使用提供的SDK.

Google Cloud Dataflow will use the supplied SDK.

这篇关于Dataflow中的自定义Apache Beam Python版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆