对 BigQuery 分区中的行进行重复数据删除 [英] Deduplicate rows in a BigQuery partition

查看:20
本文介绍了对 BigQuery 分区中的行进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多重复行的表 - 但我只想一次删除一个分区的重复行.

I have a table with many duplicated rows - but I only want to deduplicate rows one partition at a time.

我该怎么做?

例如,您可以从按日期分区并填充 1 到 5 的随机整数的表开始:

As an example, you can start with a table partitioned by date and filled with random integers from 1 to 5:

CREATE OR REPLACE TABLE `temp.many_random`
PARTITION BY d
AS 
SELECT DATE('2018-10-01') d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
UNION ALL
SELECT CURRENT_DATE() d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))

推荐答案

让我们看看现有表中有哪些数据:

Let's see what data we have in the existing table:

SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2

有很多重复!

我们可以使用 MERGESELECT DISTINCT * 对单个分区进行重复数据删除,如下所示:

We can de-duplicate one single partition using MERGE and SELECT DISTINCT * with a query like this:

MERGE `temp.many_random` t
USING (
  SELECT DISTINCT *
  FROM `temp.many_random`
  WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

那么最终的结果是这样的:

Then the end result looks like this:

我们需要确保 SELECT 和带有 THEN DELETE 的行中的日期相同.这将删除该分区上的所有行,并插入 SELECT DISTINCT 中的所有行.

We need to make sure to have the same date in the SELECT and the row with THEN DELETE. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT.

灵感来源:

要删除整个表的重复数据,请参阅:

To de-duplicate a whole table, see:

这篇关于对 BigQuery 分区中的行进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆