使用 Apache Beam 向 BigQuery 发送插入时如何指定 insertId [英] How to specify insertId when spreaming insert to BigQuery using Apache Beam

查看:24
本文介绍了使用 Apache Beam 向 BigQuery 发送插入时如何指定 insertId的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BigQuery 支持对流式插入进行重复数据删除.如何使用 Apache Beam 使用此功能?

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency

为了确保数据的一致性,您可以为每个插入的行提供 insertId.BigQuery 至少会记住此 ID 一分钟.如果您尝试在该时间段内流式传输同一组行并且设置了 insertId 属性,BigQuery 会使用 insertId 属性尽最大努力消除重复数据.您可能必须重试插入,因为在某些错误情况下无法确定流式插入的状态,例如系统与 BigQuery 之间的网络错误或 BigQuery 内部的错误.如果您重试插入,请对同一组行使用相同的 insertId,以便 BigQuery 可以尝试对您的数据进行重复数据删除.如需了解详情,请参阅对流式插入进行问题排查.

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

我在 Java 文档中找不到这样的功能.https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

I can not find such feature in Java doc. https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

这个问题中,他建议设置insertId在 TableRow 中.这是正确的吗?

In this question, he suggest to set insertId in TableRow. Is this correct?

https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableRow.html?is-external=true

BigQuery 客户端库具有此功能.

BigQuery client library has this feature.

https://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/bigquery/package-summary.htmlhttps://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/InsertAllRequest.java#L134

推荐答案

正如 Felipe 在评论中提到的,Dataflow 似乎已经在为自己使用 insertId 来实现恰好一次".所以我们不能手动指定insertId.

As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

这篇关于使用 Apache Beam 向 BigQuery 发送插入时如何指定 insertId的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆