Dataflow 模板是否支持 BigQuery 接收器选项的模板输入? [英] Does Dataflow templating supports template input for BigQuery sink options?

查看:21
本文介绍了Dataflow 模板是否支持 BigQuery 接收器选项的模板输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因为我有一个正在运行的静态 Dataflow,我想从这个模板创建一个模板,让我可以轻松地重用 Dataflow,而无需输入任何命令行.

As I have a working static Dataflow running, I'd like to create a template from this one to let me easily reuse the Dataflow without any command line typing.

遵循官方的创建模板教程不会提供可模板化输出的示例.

Following the Creating Templates tutorial from the official doesn't provide a sample for templatable output.

我的数据流以 BigQuery 接收器结束,它接受一些参数,例如用于存储的目标表.这个确切的参数是我想在模板中提供的参数,允许我在运行流程后选择目标存储.

My Dataflow ends with a BigQuery sink which takes a few arguments like the target table for storage. This exact parameter is the one I'd like to make available in my template allowing me to choose the target storage after running the flow.

但是,我无法使其正常工作.下面我粘贴了一些代码片段,它们可以帮助解释我遇到的确切问题.

But, I'm not able to get this working. Below I paste some code snippets which could help explaining the exact issue I have.

class CustomOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--input',
            default='gs://my-source-bucket/file.json')
        parser.add_value_provider_argument(
            '--table',
            default='my-project-id:some-dataset.some-table')

pipeline_options = PipelineOptions()

pipe = beam.Pipeline(options=pipeline_options)

custom_options = pipeline_options.view_as(CustomOptions)

(...)

# store
processed_pipe | beam.io.Write(BigQuerySink(
    table=custom_options.table.get(),
    schema='a_column:STRING,b_column:STRING,etc_column:STRING',
    create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
    write_disposition=BigQueryDisposition.WRITE_APPEND
))

创建模板时,我没有给它任何参数.一瞬间,我收到以下错误消息:

When creating the template, I did not give any parameters with it. In a split second I get the following error message:

apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: table, type: str, default_value: 'my-project-id:some-dataset.some-table').get() 未从运行时上下文调用

当我在模板创建时添加 --table 参数时,正在创建模板,但 --table 参数值然后被硬编码在模板中而不被覆盖稍后通过 table 的任何给定模板值.

When I add a --table parameter at template creation, the template is being created but the --table parameter value is then hardcoded in the template and not overridden by any given template value for table later.

当我将 table=custom_options.table.get(), 替换为 table=StaticValueProvider(str, custom_options.table.get()) 时,我遇到了同样的错误.

是否有人已经使用可自定义的 BigQuerySink 参数构建了可模板化的数据流?我很想得到一些提示.

Is there someone who already built a templatable Dataflow with customisable BigQuerySink parameters? I'd love to get some hints on this.

推荐答案

Python 目前仅支持 FileBasedSource IO 的 ValueProvider 选项.您可以通过单击您提到的链接中的 Python 选项卡来查看:https://cloud.google.com/dataflow/docs/templates/creating-模板

Python currently only supports ValueProvider options for FileBasedSource IOs. You can see that by clicking on the Python tab at the link you mentioned: https://cloud.google.com/dataflow/docs/templates/creating-templates

在管道 I/O 和运行时参数"部分下.

under the "Pipeline I/O and runtime parameters" section.

与 Java 中发生的情况不同,Python 中的 BigQuery 不使用自定义源.换句话说,它没有在 SDK 中完全实现,但在后端也包含部分(因此它是原生源").只有自定义源可以使用模板.有计划将 BigQuery 添加为自定义源:issues.apache.org/jira/browse/BEAM-1440

Unlike what happens in Java, BigQuery in Python does not use a custom source. In other words, it is not fully implemented in the SDK but also contains parts in the backend (and it is therefore a "native source"). Only custom sources can use templates. There are plans to have BigQuery added as custom source: issues.apache.org/jira/browse/BEAM-1440

这篇关于Dataflow 模板是否支持 BigQuery 接收器选项的模板输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆