对BigQuery分区中的行进行重复数据删除 [英] Deduplicate rows in a BigQuery partition

查看:215
本文介绍了对BigQuery分区中的行进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多重复行的表-但我只想一次对一个分区的行进行重复数据删除.

I have a table with many duplicated rows - but I only want to deduplicate rows one partition at a time.

我该怎么做?

例如,您可以从按日期划分并填充1到5的随机整数的表开始:

As an example, you can start with a table partitioned by date and filled with random integers from 1 to 5:

CREATE OR REPLACE TABLE `temp.many_random`
PARTITION BY d
AS 
SELECT DATE('2018-10-01') d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
UNION ALL
SELECT CURRENT_DATE() d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))

推荐答案

让我们看看现有表中有哪些数据:

Let's see what data we have in the existing table:

SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2

很多重复!

我们可以使用MERGESELECT DISTINCT *对一个分区进行重复数据删除,如下所示:

We can de-duplicate one single partition using MERGE and SELECT DISTINCT * with a query like this:

MERGE `temp.many_random` t
USING (
  SELECT DISTINCT *
  FROM `temp.many_random`
  WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

然后最终结果如下:

我们需要确保在SELECT和带有THEN DELETE的行中具有相同的日期.这将删除该分区上的所有行,并插入SELECT DISTINCT中的所有行.

We need to make sure to have the same date in the SELECT and the row with THEN DELETE. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT.

灵感来自:

要对整个表格进行重复数据删除,请参见:

To de-duplicate a whole table, see:

这篇关于对BigQuery分区中的行进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆