使用Apache Beam将插入式广告扩展到BigQuery时如何指定insertId [英] How to specify insertId when spreaming insert to BigQuery using Apache Beam

查看:115
本文介绍了使用Apache Beam将插入式广告扩展到BigQuery时如何指定insertId的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BigQuery支持重复数据删除以进行流插入。如何通过Apache Beam使用此功能?

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency


为帮助确保数据一致性,您可以为每个插入的行提供insertId。 BigQuery至少会记住此ID一分钟。如果您尝试在该时间段内流传输相同的行集,并且设置了insertId属性,则BigQuery将使用insertId属性来尽最大努力消除重复数据。您可能需要重试插入,因为在某些错误情况下(例如系统与BigQuery之间的网络错误或BigQuery中的内部错误),无法确定流式插入的状态。如果您重试插入,请对相同的行集使用相同的insertId,以便BigQuery可以尝试对数据进行重复数据删除。有关更多信息,请参见对流插入进行故障排除。

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

我在Java文档中找不到此类功能。
https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

I can not find such feature in Java doc. https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

这个问题,他建议在TableRow中设置insertId。

In this question, he suggest to set insertId in TableRow. Is this correct?

https://developers.google.com/resources/api-libraries/documentation/bigquery/v2 /java/latest/com/google/api/services/bigquery/model/TableRow.html?is-external=true

BigQuery客户端库具有此功能功能。

BigQuery client library has this feature.

> https://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com /google/cloud/bigquery/package-summary.html
http s://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/InsertAllRequest.java# L134

推荐答案

正如Felipe在评论中提到的那样,看来Dataflow已经在使用insertId本身来实现一次。因此我们无法手动指定insertId。

As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

这篇关于使用Apache Beam将插入式广告扩展到BigQuery时如何指定insertId的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆