Dataflow中的自定义Apache Beam Python版本 [英] Custom Apache Beam Python version in Dataflow
问题描述
我想知道是否可以在Google Dataflow中运行自定义的Apache Beam Python版本.公共存储库中不可用的版本(撰写本文时:0.6.0和2.0.0).例如,来自Apache Beam官方存储库的HEAD版本,或与此相关的特定标签.
I am wondering if it is possible to have a custom Apache Beam Python version running in Google Dataflow. A version that is not available in the public repositories (as of this writing: 0.6.0 and 2.0.0). For example, the HEAD version from the official repository of Apache Beam, or a specific tag for that matter.
我知道按照官方问题,以了解如何针对其他一些脚本执行此操作.甚至还有一个GIST 指南.
I am aware of the possibility of packing custom packages (private local ones for example) as described in the official documentation. There are answered are questions here on how to do this for some other scripts. And there is even a GIST guiding on this.
但是我还没有设法获得当前Apache Beam开发版本(或带标签的版本),该版本可在其官方存储库的master分支中获得,以打包并通过我的脚本发送到Google Dataflow.
例如,对于最新的可用标签,PiP要处理的链接将是:git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python
我得到这样的东西:
But I have not managed to get the current Apache Beam developing version (or a tagged one) that is available in the master branch of its official repository to get packaged and sent along my script to Google Dataflow.
For example, for the latest available tag, whose link for PiP to process would be: git+https://github.com/apache/beam.git@v2.1.0-RC2#egg=apache_beam[gcp]&subdirectory=sdks/python
I get something like this:
INFO:root:Executing command: ['.../bin/python', '-m', 'pip', 'install', '--download', '/var/folders/nw/m_035l9d7f1dvdbd7rr271tcqkj80c/T/tmpJhCkp8', 'apache-beam==2.1.0', '--no-binary', ':all:', '--no-deps']
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting apache-beam==2.1.0
Could not find a version that satisfies the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)
No matching distribution found for apache-beam==2.1.0
有什么想法吗? (我想知道是否有可能,因为Google Dataflow可能已经修复了可以运行到官方发布版本的Apache Beam版本.)
Any ideas? (I am wondering if it is even possible since Google Dataflow may have fixed the versions of Apache Beam that can run to the official released ones).
推荐答案
在我一直在帮助的一个Apache Beam的JIRA上得到这个问题的答案时,我会回答自己.
I will answer myself as I got the answer of this question at one Apache Beam's JIRA I have been helping with.
如果要在Google Cloud Dataflow中使用自定义的Apache Beam Python版本(即使用--runner DataflowRunner
运行管道,则在运行管道时必须使用选项--sdk_location <apache_beam_v1.2.3.tar.gz>
;其中<apache_beam_v1.2.3.tar.gz>
是您要使用的相应打包版本的位置.
If you want to use a custom Apache Beam Python version in Google Cloud Dataflow (that is, run your pipeline with the --runner DataflowRunner
, you must use the option --sdk_location <apache_beam_v1.2.3.tar.gz>
when you run your pipeline; where <apache_beam_v1.2.3.tar.gz>
is the location of the corresponding packaged version that you want to use.
例如,在撰写本文时,如果您已签出Apache Beam的git cd beam/sdks/python
导航到Python SDK来打包存储库,然后运行python setup.py sdist
(将在dist
子目录中创建压缩的tar文件).
For example, as of this writing, if you have checked out the HEAD
version of the Apache Beam's git repository, you have to first package the repository by navigating to the Python SDK with cd beam/sdks/python
and then run python setup.py sdist
(a compressed tar file will be created in the dist
subdirectory).
此后,您可以像这样运行管道:
Thereafter you can run your pipeline like this:
python your_pipeline.py [...your_options...] --sdk_location beam/sdks/python/dist/apache-beam-2.2.0.dev0.tar.gz
Google Cloud Dataflow将使用提供的SDK.
Google Cloud Dataflow will use the supplied SDK.
这篇关于Dataflow中的自定义Apache Beam Python版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!