对 BigQuery 分区中的行进行重复数据删除 [英] Deduplicate rows in a BigQuery partition
问题描述
我有一个包含许多重复行的表 - 但我只想一次删除一个分区的重复行.
I have a table with many duplicated rows - but I only want to deduplicate rows one partition at a time.
我该怎么做?
例如,您可以从按日期分区并填充 1 到 5 的随机整数的表开始:
As an example, you can start with a table partitioned by date and filled with random integers from 1 to 5:
CREATE OR REPLACE TABLE `temp.many_random`
PARTITION BY d
AS
SELECT DATE('2018-10-01') d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
UNION ALL
SELECT CURRENT_DATE() d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
推荐答案
让我们看看现有表中有哪些数据:
Let's see what data we have in the existing table:
SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2
有很多重复!
我们可以使用 MERGE
和 SELECT DISTINCT *
对单个分区进行重复数据删除,如下所示:
We can de-duplicate one single partition using MERGE
and SELECT DISTINCT *
with a query like this:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
那么最终的结果是这样的:
Then the end result looks like this:
我们需要确保 SELECT
和带有 THEN DELETE
的行中的日期相同.这将删除该分区上的所有行,并插入 SELECT DISTINCT
中的所有行.
We need to make sure to have the same date in the SELECT
and the row with THEN DELETE
. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT
.
灵感来源:
要删除整个表的重复数据,请参阅:
To de-duplicate a whole table, see:
这篇关于对 BigQuery 分区中的行进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!