在BigQuery中追加时忽略重复的记录 [英] Ignore duplicate records while appending in BigQuery

查看:60
本文介绍了在BigQuery中追加时忽略重复的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在将数据从MySql写入BigQuery.我们已经设置了一些指标,例如

We are writing the data from MySql to BigQuery. We have set some indicators like

  • 插入-如果是第一次添加记录,则在指标"字段中用"I"保存它
  • 更新-如果记录中包含一些更新的数据,则将其保存在指示符"字段中并用"U"保存,如果没有更改,则忽略重复的记录.

但是在更新"的情况下,它也会写入重复的记录,甚至没有改变.以下是我们当前用于将数据插入BigQuery表中的查询.我们可以对此查询进行哪些更改?

But in case of 'Update' it's writing duplicated records as well, which has not even changed. Following is the query we are currently using to insert the data into BigQuery table. What changes can we made to this query?

"insert into `actual_table` 

(
    Id,
   ...
)
select
temp.Id,
...
case when actual.Id is null then 'I'
when actual.Id is not null and actual.field1<>temp.field1 then 'U'
end as Indicator,
FROM `temp_table` temp 
left outer join `actual_table` actual
on temp.Id= actual.Id"

实际表是BigQuery中的表,而临时表是bigquery上的登台表.每次我们从MySql读取数据时,我们都会将其存储在temp表中.

Actual table is the table in BigQuery whereas temp table is the staging table on bigquery. Everytime we read data from MySql, we store it in temp table.

谢谢

推荐答案

我喜欢BigQuery的另一个选项是使用merge DML进行插入,如果这是您的用例,那么这是一个很不错的解决方案.您可以在此链接中查看更多详细信息..

Another option I like with BigQuery is doing the inserts using merge DML, It's quite a neat solution if this suite your use case. You can see more details in this link.

Sql示例:

MERGE
    `mytable` as tgt
USING
    `mytable` as src
ON FALSE
WHEN NOT MATCHED AND src._PARTITIONTIME = '2019-02-21'
THEN INSERT (_PARTITIONTIME, fields...) VALUES (_PARTITIONTIME, fields...)
WHEN NOT MATCHED BY SOURCE AND tgt._PARTITIONTIME = '2019-02-21'
THEN DELETE

这篇关于在BigQuery中追加时忽略重复的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆