如何在BigQuery中按时间戳删除重复的行? [英] How to remove duplicated row by timestamp in BigQuery?
问题描述
我有一个具有以下架构的产品表:
I have a products table with the following schema:
id createdOn, updatedOn, stock, status
createdOn
& updatedOn
是 TimeStamp
.
createdOn
是paratition字段.
createdOn
is the paratition field.
说这是我现在拥有的数据:
Say this is the data I have now:
id createdOn, updatedOn, stock, status
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
3 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
我有一个 ETL
,可以向该表追加新行.当ETL完成时,我可能会遇到相同的 id
具有多于1行的情况.
I have a ETL
that append new rows to this table. when the ETL is finished I can have a situation where the same id
has more than 1 row.
例如:
id createdOn, updatedOn, stock, status
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
3 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-15 10:00:00.000000 7 5
我想要一个查询,该查询将在表上运行,并确保每个id仅包含1行-应该保留具有 MAX(updatedOn)
的行.每个ID的 MAX(updatedOn)
可以有多于一行-在这种情况下,可以保证它们是相同的,因为如果不是,那么 updatedOn
字段会被修改.
I want to have a query which will run over the table and make sure that each id has only 1 row - the row with the MAX(updatedOn)
should stay. There can be more than 1 row for the MAX(updatedOn)
per id - in that case it's guarantee that they are identical, because if they weren't than the updatedOn
field would have been modified.
查询后,表将如下所示:
After ruuning the query the table will look like:
id createdOn, updatedOn, stock, status
2 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 5 12
1 2018-09-14 14:14:24.305676 2018-09-14 14:14:24.305676 10 5
3 2018-09-14 14:14:24.305676 2018-09-15 10:00:00.000000 7 5
如何编写有效执行此操作的查询?
How can I write a query that efficiently perform this?
我知道应该是这样的
DELETE FROM products
WHERE id NOT IN
(
SELECT MAX(id)
FROM products
GROUP BY id
)
但是,这行不通...我没有自动递增字段来区分行.
However this won't work... I don't have auto-increment field to distinguish the rows.
我该如何解决?
推荐答案
使用 row_number
函数
DELETE FROM products
WHERE STRUCT(id,createdOn,stock) IN (
select id,createdOn,stock from
(SELECT id,createdOn,stock,
ROW_NUMBER()
OVER (PARTITION BY createdOn,stock,updatedOn order by id) as rn from products
) t where rn>1
)
您可以遵循的另一种方法,而不是删除,可以替换表
Another way you can follow rather delete you can replace your table
CREATE OR REPLACE TABLE products AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY createdOn,stock,updatedOn order by id) rn
FROM products
)
WHERE rn> 1
这篇关于如何在BigQuery中按时间戳删除重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!