如何将在执行相同的Dataflow管道期间计算出的架构写入BigQuery? [英] How do I write to BigQuery a schema computed during execution of the same Dataflow pipeline?

查看:49
本文介绍了如何将在执行相同的Dataflow管道期间计算出的架构写入BigQuery?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的情况是此处讨论的情况的变体: 如何写入BigQuery使用在数据流执行期间计算出的架构?

My scenario is a variation on the one discussed here: How do I write to BigQuery using a schema computed during Dataflow execution?

在这种情况下,目标是相同的(在执行过程中读取模式,然后将具有该模式的表写入BigQuery),但是我想在一个管道中完成它.

In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.

例如,我想将CSV文件写入BigQuery,并避免两次获取文件(一次读取架构,一次读取数据).

For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).

这可能吗?如果是这样,最好的方法是什么?

Is this possible? If so, what's the best approach?

我目前的最佳猜测是通过侧面输出将架构读取到PCollection中,然后在将数据传递到BigQueryIO.Write之前,使用该架构创建表(具有自定义PTransform).

My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.

推荐答案

如果使用

If you use BigQuery.Write to create the table then the schema needs to known when the table is created.

您提出的在创建

Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.

在运行管道之前,您可能需要考虑在主程序中读取足够的CSV文件来确定架构.这将避免在运行时确定模式的复杂性.您仍然会承担额外阅读的费用,但希望这是最小的.

You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.

或者,您创建一个自定义接收器 将您的数据写入BigQuery.您的加载作业.您的自定义接收器可以通过查看记录来推断架构,并使用适当的架构创建BigQuery表.

Alternatively you create a custom sink to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.

这篇关于如何将在执行相同的Dataflow管道期间计算出的架构写入BigQuery?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆