使用暂存模板多次部署时,Dataflow 作业使用相同的 BigQuery 作业 ID? [英] Dataflow job uses same BigQuery job ID when deploying using a staged template multiple times?

查看:22
本文介绍了使用暂存模板多次部署时,Dataflow 作业使用相同的 BigQuery 作业 ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试部署一个 Dataflow 作业,该作业从 BigQuery 读取并按固定时间表写入 Cassandra.模板代码是使用 Apache Beam 和 Dataflow 库用 Java 编写的.我已将模板上演到 Google Cloud Storage,并配置了 Cloud Scheduler 实例以及用于触发 Dataflow 模板的 Cloud 函数.我为所有 Beam 和 BigQuery 依赖项使用最新版本.

I am attempting to deploy a Dataflow job that reads from BigQuery and writes to Cassandra on a fixed schedule. The template code has been written in Java using Apache Beam, and the Dataflow library. I have staged the template onto Google Cloud Storage, and have configured a Cloud Scheduler instance as well as Cloud function used to trigger the Dataflow template. I am using the latest version for all Beam and BigQuery dependencies.

但是,我发现在使用相同的暂存模板部署作业时,BigQuery 提取作业似乎总是使用相同的作业 ID,这会导致日志中显示 409 失败.BigQuery 查询作业似乎是成功的,因为查询作业 ID 附加了唯一后缀,而提取作业 ID 使用相同的前缀,但没有后缀.

However, I have discovered that when deploying a job using the same staged template, the BigQuery extract job seems to always use the same job ID, which causes a 409 failure shown in the logs. The BigQuery query job seems to be successful, because the query job ID has a unique suffix appended, while the extract job ID uses the same prefix, but without a suffix.

我考虑了两种替代解决方案:使用 crontab 直接在计算引擎实例上部署管道以直接部署模板,或者调整 Cloud 函数以按计划执行与 Dataflow 管道相同的任务.理想情况下,如果有更改 Dataflow 作业中提取作业 ID 的解决方案,这将是一个更简单的解决方案,但我不确定这是否可行?此外,如果这是不可能的,是否有更优化的替代解决方案?

I have considered two alternate solutions: either using a crontab to deploy the pipeline directly on a compute engine instance to deploy the template directly, or adapting a Cloud function to perform the same tasks as the Dataflow pipeline on a schedule. Ideally, if there is a solution for changing the extract job ID in the Dataflow job it would be a much easier solution but I'm not sure if this is possible? Also if this is not possible, is there an alternate solution that is more optimal?

推荐答案

根据附加说明,这听起来可能是未使用 withTemplateCompatability() 按照指示?

Based on the additional description, it sounds like this may be a case of not using withTemplateCompatability() as directed?

与模板一起使用

在模板中使用 read() 或 readTableRows() 时,需要指定 BigQueryIO.Read.withTemplateCompatibility().不建议在非模板管道中指定此项,因为它的性能稍低.

When using read() or readTableRows() in a template, it's required to specify BigQueryIO.Read.withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.

这篇关于使用暂存模板多次部署时,Dataflow 作业使用相同的 BigQuery 作业 ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆