BigQuery标准SQL:从表格中删除重复项 [英] BigQuery Standard SQL: Delete Duplicates from Table

查看:119
本文介绍了BigQuery标准SQL:从表格中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下查询,使用标准sql从bigquery中删除重复记录.但这会引发错误

I am using below query to delete duplicates records from bigquery using standard sql. but it is throwing error

with cte as (
select * ,row_number()over (partition by CallRailCallId order by CallRailCallId) as rn
from `encoremarketingtest.EncoreMarketingTest.CallRailCall2` )

delete
 from cte
where rn>1

Query Failed
Error: Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [5:5]

有人可以在BigQuery中为我提供正确的方法吗?

Could anyone help me on the correct approach in BigQuery?

推荐答案

选项1

CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY CallRailCallId ORDER BY CallRailCallId) rn
  FROM `project.dataset.your_table`
) 
WHERE rn = 1 

选项#2

Option #2

CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY CallRailCallId LIMIT 1)[OFFSET(0)] row
  FROM `project.dataset.your_table` t
  GROUP BY CallRailCallId
)   

您可能会注意到,以上使用 DDL ( CREATE TABLE)方法,在这里可以只使用一个已知的(根据您的问题)列-CallRailCallId
另外,请注意-ORDER BY CallRailCallId在这里没有实际作用,因为GROUP BY和PARTITION BY完全相同.但是,如果您更改该字段,则将控制幸存"中的确切行(在少数重复项中)(例如ORDER BY ts DESC-ts可能是下面的选项)

As you might noticed, above options using DDL(CREATE TABLE) approach and that is where it is possible to use just one known (from your question) column - CallRailCallId
Also, note - ORDER BY CallRailCallId plays no real role there because GROUP BY and PARTITION BY are by exactly same filed. But if you change the field this will control which exactly row (out of few duplicates) to "survive" (For example ORDER BY ts DESC - see below option for what ts might be)

选项3

此选项使用 DML (删除FROM),但需要使用一些额外的列作为平局

This option uses DML(DELETE FROM) but requires some extra column to be used to serve as a tie-breaker

例如,您有ts TIMESTAMP字段,并且您希望最新行(基于ts)可以保留

For example you have ts TIMESTAMP field and you want the most recent (based on ts) row to survive

DELETE FROM `project.dataset.your_table`
WHERE STRUCT(CallRailCallId, ts) NOT IN (
  SELECT AS STRUCT CallRailCallId, MAX(ts) ts
  FROM `project.dataset.your_table`
  GROUP BY CallRailCallId
  )

这篇关于BigQuery标准SQL:从表格中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆