从BigQuery表中删除最早的重复行 [英] Delete Oldest Duplicate Rows from a BigQuery Table

查看：48 发布时间：2021/5/12 18:39:18 google-bigquery

本文介绍了从BigQuery表中删除最早的重复行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个表，其中包含> 70M的数据行和2M的重复项.我想通过保留最近的原始行来清除重复项.

我从这里找到了一些解决方案-链接

其中，解决方案仅是清除重复项，而不保留重复项中的最新数据.

这是另一个常见的解决方案:

 ;与cteAS(SELECT Row_number()OVER(分区BY ID ORDER BYUpdatedAtDESC，状态DESC)RN从MainTable)从CTE删除RN>1个

但是BigQuery不支持它.

解决方案

这里是解决方法，它用唯一的行和最近的原始行替换了现有表.

 创建或替换表`MainTable` AS选择ID，acctId，appId，createdAt，开始时间，subAcctId，类型，UpdatedAt，用户身份从 (选择*，ROW_NUMBER()OVER(PARTITION BY ID ORDER BY BYupdatedDESC-重复项中的第一行将保留，其他行将被移除)RN从`MainTable`)在哪里RN = 1

由于我们没有选择删除特定的列(rn)的选项，因此在替换现有表时必须选择所需的列.

希望这对某人有帮助.如果您有更好的解决方案，请分享.

I have a table with >70M rows of data and 2M of duplicates. I want to clean duplicates by keeping the recent original row.

I found a few solutions from here - link

In which, solutions are only to clean the duplicates and not retain the recent data among the duplicates.

here is another common solution:

;WITH cte 
     AS (SELECT Row_number() OVER (partition BY id ORDER BY 
                updatedAt 
                DESC, 
                status DESC) RN 
         FROM   MainTable) 
DELETE FROM cte 
WHERE  RN > 1

But it is not supported in BigQuery.

解决方案

Here is the workaround, which replaces the existing table with unique rows and recent original rows.

CREATE OR REPLACE TABLE
  `MainTable` AS
SELECT
  id,
  acctId,
  appId,
  createdAt,
  startTime,
  subAcctId,
  type,
  updatedAt,
  userId
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedAt DESC -- the first row among duplicates will be kept, other rows will be removed
      ) RN
  FROM
    `MainTable`)
WHERE
  RN = 1

Since we don't have the option to remove a particular column(rn), have to select the required columns while replacing the existing table.

Hope this helps someone. Please share if you have any better solutions.

这篇关于从BigQuery表中删除最早的重复行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从BigQuery表中删除最早的重复行 [英] Delete Oldest Duplicate Rows from a BigQuery Table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从BigQuery表中删除最早的重复行 [英] Delete Oldest Duplicate Rows from a BigQuery Table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭