写入表后的Apache Beam Pipeline查询表 [英] Apache Beam Pipeline Query Table After Writing Table

查看:75
本文介绍了写入表后的Apache Beam Pipeline查询表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Apache Beam/Dataflow管道,正在将结果写入BigQuery表.然后,我想在此表中查询管道的单独部分.但是,我似乎无法弄清楚如何正确设置此管道依赖项.我编写(然后要查询)的新表与单独的表结合在一起以进行某些过滤逻辑,这就是为什么我实际上需要编写表然后运行查询的原因.逻辑如下:

I have a Apache Beam/Dataflow pipeline that is writing results to a BigQuery table. I would then like to query this table for a separate portion of the pipeline. However, I can't seem to figure out how to properly set up this pipeline dependency. The new table that I write (and then want to query) is left joined with a separate table for some filtering logic and that is why I actually need to write the table and then run the query. The logic would be as follows:

with beam.Pipeline(options=pipeline_options) as p:
    table_data = p | 'CreatTable' >> # ... logic to generate table ...

    # Write Table to BQ
    table_written = table_data | 'WriteTempTrainDataBQ' >> beam.io.WriteToBigQuery(...)

    query_results = table_written | 'QueryNewTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_new_table))

如果query_new_table实际上是对已经存在的BQ表的查询,并且我更改为query_results = p |而不是table_written,则此操作正常.但是,如果我尝试查询要在管道中间写入的表,那么在实际生成该表之前,我无法使管道步骤等待".有什么办法可以做我所忽略的吗?

if query_new_table is actually a query of an already existing BQ table and I change to query_results = p | instead of table_written this works properly. However, if I try to query the table that I am writing in the middle of the pipeline then I cannot get the pipeline step to "wait" until that table has actually been generated. Is there any way to do this that I am overlooking?

当我尝试按顺序执行此步骤时,我得到一个断言错误assert isinstance(pbegin, pvalue.PBegin) AssertionError,由于不是有效的PCollection实例,我正在读它以表示table_written是问题.

When I try to make this step sequential, I am getting an assertion error assert isinstance(pbegin, pvalue.PBegin) AssertionError which I am reading to mean that table_written is the issue as it is not a valid PCollection instance.

有人知道我可以代替table_write使其真正按需顺序运行吗?

Does anybody know what I would could put in place of table_written to make this actually run sequentially as desired?

推荐答案

Beam目前不支持用例在BigQuery写入完成后执行某些操作".唯一的解决方法是运行单独的管道:让您的主程序为:运行写入BigQuery的管道;等待管道完成;运行另一个从BigQuery读取的管道.

The use case "do something after a BigQuery write is complete" is not supported by Beam currently. The only workaround is to run separate pipelines: have your main program be: run the pipeline that writes to BigQuery; wait for the pipeline to finish; run another pipeline that reads from BigQuery.

这是一个非常经常需要的功能,我们开始设计这种支持(更普遍的是,升级各种IO写入以支持它们之后的排序操作),但是我不知道何时完成.

This is a very frequently requested feature and we're beginning to design this support (more generally, upgrading various IO writes to support sequencing operations after them), but I don't know when we'll be done.

这篇关于写入表后的Apache Beam Pipeline查询表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆