写表后Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

查看：17 发布时间：2021/11/11 22:32:11 python google-cloud-dataflow apache-beam

本文介绍了写表后Apache Beam Pipeline查询表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个将结果写入 BigQuery 表的 Apache Beam/Dataflow 管道.然后我想查询此表以获取管道的单独部分.但是，我似乎无法弄清楚如何正确设置此管道依赖项.我编写(然后想要查询)的新表与用于某些过滤逻辑的单独表相连，这就是为什么我实际上需要编写表然后运行查询的原因.逻辑如下:

I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows:

with beam.Pipeline(options=pipeline_options) as p:
    table_data = p | 'CreatTable' >> # ... logic to generate table ...

    # Write Table to BQ
    table_written = table_data | 'WriteTempTrainDataBQ' >> beam.io.WriteToBigQuery(...)

    query_results = table_written | 'QueryNewTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_new_table))

如果 query_new_table 实际上是对已经存在的 BQ 表的查询，并且我更改为 query_results = p | 而不是 table_written 这可以正常工作.但是，如果我尝试查询我在管道中间写入的表，那么在实际生成该表之前，我无法让管道步骤等待".有没有办法做到这一点，我忽略了?

if query_new_table is actually a query of an already existing BQ table and I change to query_results = p | instead of table_written this works properly. However, if I try to query the table that I am writing in the middle of the pipeline then I cannot get the pipeline step to "wait" until that table has actually been generated. Is there any way to do this that I am overlooking?

当我尝试按顺序执行此步骤时，出现断言错误 assert isinstance(pbegin, pvalue.PBegin) AssertionError 我正在阅读它的意思是 table_written是问题，因为它不是有效的 PCollection 实例.

When I try to make this step sequential, I am getting an assertion error assert isinstance(pbegin, pvalue.PBegin) AssertionError which I am reading to mean that table_written is the issue as it is not a valid PCollection instance.

有人知道我可以用什么来代替 table_written 以使其实际按需要顺序运行吗?

Does anybody know what I would could put in place of table_written to make this actually run sequentially as desired?

写表后Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

写表后Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭