BigQuery标准SQL:从表格中删除重复项 [英] BigQuery Standard SQL: Delete Duplicates from Table
问题描述
我正在使用以下查询,使用标准sql从bigquery中删除重复记录.但这会引发错误
I am using below query to delete duplicates records from bigquery using standard sql. but it is throwing error
with cte as (
select * ,row_number()over (partition by CallRailCallId order by CallRailCallId) as rn
from `encoremarketingtest.EncoreMarketingTest.CallRailCall2` )
delete
from cte
where rn>1
Query Failed
Error: Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [5:5]
有人可以在BigQuery中为我提供正确的方法吗?
Could anyone help me on the correct approach in BigQuery?
推荐答案
选项1
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CallRailCallId ORDER BY CallRailCallId) rn
FROM `project.dataset.your_table`
)
WHERE rn = 1
选项#2
Option #2
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY CallRailCallId LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.your_table` t
GROUP BY CallRailCallId
)
您可能会注意到,以上使用 DDL
( CREATE TABLE)方法,在这里可以只使用一个已知的(根据您的问题)列-CallRailCallId
另外,请注意-ORDER BY CallRailCallId
在这里没有实际作用,因为GROUP BY和PARTITION BY完全相同.但是,如果您更改该字段,则将控制幸存"中的确切行(在少数重复项中)(例如ORDER BY ts DESC
-ts可能是下面的选项)
As you might noticed, above options using DDL
(CREATE TABLE) approach and that is where it is possible to use just one known (from your question) column - CallRailCallId
Also, note - ORDER BY CallRailCallId
plays no real role there because GROUP BY and PARTITION BY are by exactly same filed. But if you change the field this will control which exactly row (out of few duplicates) to "survive" (For example ORDER BY ts DESC
- see below option for what ts might be)
选项3
此选项使用 DML
(删除FROM),但需要使用一些额外的列作为平局
This option uses DML
(DELETE FROM) but requires some extra column to be used to serve as a tie-breaker
例如,您有ts
TIMESTAMP字段,并且您希望最新行(基于ts)可以保留
For example you have ts
TIMESTAMP field and you want the most recent (based on ts) row to survive
DELETE FROM `project.dataset.your_table`
WHERE STRUCT(CallRailCallId, ts) NOT IN (
SELECT AS STRUCT CallRailCallId, MAX(ts) ts
FROM `project.dataset.your_table`
GROUP BY CallRailCallId
)
这篇关于BigQuery标准SQL:从表格中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!