写入表后的Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

查看：75 发布时间：2020/9/3 4:53:32 python google-cloud-dataflow apache-beam

本文介绍了写入表后的Apache Beam Pipeline查询表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Apache Beam/Dataflow管道，正在将结果写入BigQuery表.然后，我想在此表中查询管道的单独部分.但是，我似乎无法弄清楚如何正确设置此管道依赖项.我编写(然后要查询)的新表与单独的表结合在一起以进行某些过滤逻辑，这就是为什么我实际上需要编写表然后运行查询的原因.逻辑如下:

I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows:

with beam.Pipeline(options=pipeline_options) as p:
    table_data = p | 'CreatTable' >> # ... logic to generate table ...

    # Write Table to BQ
    table_written = table_data | 'WriteTempTrainDataBQ' >> beam.io.WriteToBigQuery(...)

    query_results = table_written | 'QueryNewTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_new_table))

如果query_new_table实际上是对已经存在的BQ表的查询，并且我更改为query_results = p |而不是table_written，则此操作正常.但是，如果我尝试查询要在管道中间写入的表，那么在实际生成该表之前，我无法使管道步骤等待".有什么办法可以做我所忽略的吗?

if query_new_table is actually a query of an already existing BQ table and I change to query_results = p | instead of table_written this works properly. However, if I try to query the table that I am writing in the middle of the pipeline then I cannot get the pipeline step to "wait" until that table has actually been generated. Is there any way to do this that I am overlooking?

当我尝试按顺序执行此步骤时，我得到一个断言错误assert isinstance(pbegin, pvalue.PBegin) AssertionError，由于不是有效的PCollection实例，我正在读它以表示table_written是问题.

When I try to make this step sequential, I am getting an assertion error assert isinstance(pbegin, pvalue.PBegin) AssertionError which I am reading to mean that table_written is the issue as it is not a valid PCollection instance.

有人知道我可以代替table_write使其真正按需顺序运行吗?

Does anybody know what I would could put in place of table_written to make this actually run sequentially as desired?

写入表后的Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

写入表后的Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭