Apache Beam-> BigQuery-重复数据删除的insertId不起作用 [英] Apache Beam -> BigQuery - insertId for deduplication not working

查看:72
本文介绍了Apache Beam-> BigQuery-重复数据删除的insertId不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有Google数据流运行器的Apache Beam将数据从kafka流式传输到BigQuery. 我想利用insertId进行重复数据删除,这在Google文档中已找到.但是,即使相互之间都在几秒钟之内发生插入,我仍然看到很多具有相同insertId的行. 现在,我想知道也许我没有正确使用API​​来利用重复数据删除机制来处理BQ提供的流插入.

I am streaming data from kafka to BigQuery using apache beam with google dataflow runner. I wanted to make use of insertId for deduplication, that I found described in google docs. But even tho inserts are happening within few seconds from each other I still see a lot of rows with the same insertId. Now I'm wondering that perhaps I am not using the API correctly to take advantage of deduplication mechanism for streaming inserts offered by BQ.

我写作中的代码如下:

payments.apply("Write Fx Payments to BQ", BigQueryIO.<FxPayment>write()
            .withFormatFunction(ps -> FxTableRowConverter.convertFxPaymentToTableRow(ps))
            .to(bqTradePaymentTable)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

除了所有其他字段外,我还直接在FxTableRowConverter.convertFxPaymentToTableRow方法的TableRow上设置insertId,并将其作为格式函数传递给BigQueryIO:

Besides all other fields I am setting insertId directly on TableRow in FxTableRowConverter.convertFxPaymentToTableRow method passed to BigQueryIO as format function:

row.set("insertId", insertId);

我还将该字段添加为BQ的一列.没有它,插入失败(很明显). 除了将其添加到TableRow对象之外,我找不到其他直接在BigQueryIO上设置insertId的方法.

I also added that field as a column to BQ. Without it, it was failing on inserts (obviously). I couldn't find any other way to set insertId directly on BigQueryIO other than adding it to TableRow object.

这是使用它的正确方法吗?因为它对我不起作用,所以我什至看到很多重复,甚至我不应该看到,因为就像我已经提到的,插入在几秒钟之内发生. BigQuery文件指出,串流缓冲区将insertId保持至少一分钟.

Is this the correct way of using this? Because it does not work for me, I am seeing many duplicates even tho I shouldn't, since like I already mentioned inserts are happening within seconds. BigQuery doc states that streaming buffer is keeping insertId for at least one minute.

推荐答案

您无法在Dataflow https中为BigQuery流手动指定insertId ://stackoverflow.com/a/54193825/1580227

You can't manually specify insertId for BigQuery streaming in Dataflow https://stackoverflow.com/a/54193825/1580227

这篇关于Apache Beam-&gt; BigQuery-重复数据删除的insertId不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆