写表后Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

查看:17
本文介绍了写表后Apache Beam Pipeline查询表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个将结果写入 BigQuery 表的 Apache Beam/Dataflow 管道.然后我想查询此表以获取管道的单独部分.但是,我似乎无法弄清楚如何正确设置此管道依赖项.我编写(然后想要查询)的新表与用于某些过滤逻辑的单独表相连,这就是为什么我实际上需要编写表然后运行查询的原因.逻辑如下:

I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows:

with beam.Pipeline(options=pipeline_options) as p:
    table_data = p | 'CreatTable' >> # ... logic to generate table ...

    # Write Table to BQ
    table_written = table_data | 'WriteTempTrainDataBQ' >> beam.io.WriteToBigQuery(...)

    query_results = table_written | 'QueryNewTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_new_table))

如果 query_new_table 实际上是对已经存在的 BQ 表的查询,并且我更改为 query_results = p | 而不是 table_written 这可以正常工作.但是,如果我尝试查询我在管道中间写入的表,那么在实际生成该表之前,我无法让管道步骤等待".有没有办法做到这一点,我忽略了?

if query_new_table is actually a query of an already existing BQ table and I change to query_results = p | instead of table_written this works properly. However, if I try to query the table that I am writing in the middle of the pipeline then I cannot get the pipeline step to "wait" until that table has actually been generated. Is there any way to do this that I am overlooking?

当我尝试按顺序执行此步骤时,出现断言错误 assert isinstance(pbegin, pvalue.PBegin) AssertionError 我正在阅读它的意思是 table_written是问题,因为它不是有效的 PCollection 实例.

When I try to make this step sequential, I am getting an assertion error assert isinstance(pbegin, pvalue.PBegin) AssertionError which I am reading to mean that table_written is the issue as it is not a valid PCollection instance.

有人知道我可以用什么来代替 table_written 以使其实际按需要顺序运行吗?

Does anybody know what I would could put in place of table_written to make this actually run sequentially as desired?

推荐答案

Beam 目前不支持用例在 BigQuery 写入完成后做某事".唯一的解决方法是运行单独的管道:让您的主程序:运行写入 BigQuery 的管道;等待管道完成;运行另一个从 BigQuery 读取数据的管道.

The use case "do something after a BigQuery write is complete" is not supported by Beam currently. The only workaround is to run separate pipelines: have your main program be: run the pipeline that writes to BigQuery; wait for the pipeline to finish; run another pipeline that reads from BigQuery.

这是一个非常频繁请求的功能,我们正在开始设计此支持(更一般地说,升级各种 IO 写入以支持它们之后的排序操作),但我不知道我们什么时候会完成.

This is a very frequently requested feature and we're beginning to design this support (more generally, upgrading various IO writes to support sequencing operations after them), but I don't know when we'll be done.

这篇关于写表后Apache Beam Pipeline查询表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆