从BigQuery表中删除最早的重复行 [英] Delete Oldest Duplicate Rows from a BigQuery Table
问题描述
我有一个表,其中包含> 70M的数据行和2M的重复项.我想通过保留最近的原始行来清除重复项.
我从这里找到了一些解决方案-链接 >
其中,解决方案仅是清除重复项,而不保留重复项中的最新数据.
这是另一个常见的解决方案:
;与cteAS(SELECT Row_number()OVER(分区BY ID ORDER BYUpdatedAtDESC,状态DESC)RN从MainTable)从CTE删除RN>1个
但是BigQuery不支持它.
这里是解决方法,它用唯一的行和最近的原始行替换了现有表.
创建或替换表`MainTable` AS选择ID,acctId,appId,createdAt,开始时间,subAcctId,类型,UpdatedAt,用户身份从 (选择*,ROW_NUMBER()OVER(PARTITION BY ID ORDER BY BYupdatedDESC-重复项中的第一行将保留,其他行将被移除)RN从`MainTable`)在哪里RN = 1
由于我们没有选择删除特定的列(rn)的选项,因此在替换现有表时必须选择所需的列.
希望这对某人有帮助.如果您有更好的解决方案,请分享.
I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.
I found a few solutions from here - link
In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.
here is another common solution:
;WITH cte
AS (SELECT Row_number() OVER (partition BY id ORDER BY
updatedAt
DESC,
status DESC) RN
FROM MainTable)
DELETE FROM cte
WHERE RN > 1
But it is not supported in BigQuery.
Here is the workaround, which replaces the existing table with unique rows and recent original rows.
CREATE OR REPLACE TABLE
`MainTable` AS
SELECT
id,
acctId,
appId,
createdAt,
startTime,
subAcctId,
type,
updatedAt,
userId
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM
`MainTable`)
WHERE
RN = 1
Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.
Hope this helps someone. Please share if you have any better solutions.
这篇关于从BigQuery表中删除最早的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!