从BigQuery表中删除最早的重复行 [英] Delete Oldest Duplicate Rows from a BigQuery Table

查看:48
本文介绍了从BigQuery表中删除最早的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含> 70M的数据行和2M的重复项.我想通过保留最近的原始行来清除重复项.

我从这里找到了一些解决方案-链接

其中,解决方案仅是清除重复项,而不保留重复项中的最新数据.

这是另一个常见的解决方案:

 ;与cteAS(SELECT Row_number()OVER(分区BY ID ORDER BYUpdatedAtDESC,状态DESC)RN从MainTable)从CTE删除RN>1个 

但是BigQuery不支持它.

解决方案

这里是解决方法,它用唯一的行和最近的原始行替换了现有表.

 创建或替换表`MainTable` AS选择ID,acctId,appId,createdAt,开始时间,subAcctId,类型,UpdatedAt,用户身份从 (选择*,ROW_NUMBER()OVER(PARTITION BY ID ORDER BY BYupdatedDESC-重复项中的第一行将保留,其他行将被移除)RN从`MainTable`)在哪里RN = 1 

由于我们没有选择删除特定的列(rn)的选项,因此在替换现有表时必须选择所需的列.

希望这对某人有帮助.如果您有更好的解决方案,请分享.

I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.

I found a few solutions from here - link

In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.

here is another common solution:

;WITH cte 
     AS (SELECT Row_number() OVER (partition BY id ORDER BY 
                updatedAt 
                DESC, 
                status DESC) RN 
         FROM   MainTable) 
DELETE FROM cte 
WHERE  RN > 1 

But it is not supported in BigQuery.

解决方案

Here is the workaround, which replaces the existing table with unique rows and recent original rows.

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT
  id,
  acctId,
  appId,
  createdAt,
  startTime,
  subAcctId,
  type,
  updatedAt,
  userId
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.

Hope this helps someone. Please share if you have any better solutions.

这篇关于从BigQuery表中删除最早的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆