对BigQuery分区中的行进行重复数据删除 [英] Deduplicate rows in a BigQuery partition
问题描述
我有一个包含许多重复行的表-但我只想一次对一个分区的行进行重复数据删除.
I have a table with many duplicated rows - but I only want to deduplicate rows one partition at a time.
我该怎么做?
例如,您可以从按日期划分并填充1到5的随机整数的表开始:
As an example, you can start with a table partitioned by date and filled with random integers from 1 to 5:
CREATE OR REPLACE TABLE `temp.many_random`
PARTITION BY d
AS
SELECT DATE('2018-10-01') d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
UNION ALL
SELECT CURRENT_DATE() d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
推荐答案
让我们看看现有表中有哪些数据:
Let's see what data we have in the existing table:
SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2
很多重复!
我们可以使用MERGE
和SELECT DISTINCT *
对一个分区进行重复数据删除,如下所示:
We can de-duplicate one single partition using MERGE
and SELECT DISTINCT *
with a query like this:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
然后最终结果如下:
我们需要确保在SELECT
和带有THEN DELETE
的行中具有相同的日期.这将删除该分区上的所有行,并插入SELECT DISTINCT
中的所有行.
We need to make sure to have the same date in the SELECT
and the row with THEN DELETE
. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT
.
灵感来自:
要对整个表格进行重复数据删除,请参见:
To de-duplicate a whole table, see:
这篇关于对BigQuery分区中的行进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!