如何在BigQuery中按时间戳删除重复的行? [英] How to remove duplicated row by timestamp in BigQuery?

查看:55
本文介绍了如何在BigQuery中按时间戳删除重复的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下架构的产品表:

I have a products table with the following schema:

id  createdOn, updatedOn, stock, status

createdOn & updatedOn TimeStamp .

createdOn 是paratition字段.

createdOn is the paratition field.

说这是我现在拥有的数据:

Say this is the data I have now:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5

我有一个 ETL ,可以向该表追加新行.当ETL完成时,我可能会遇到相同的 id 具有多于1行的情况.

I have a ETL that append new rows to this table. when the ETL is finished I can have a situation where the same id has more than 1 row.

例如:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

我想要一个查询,该查询将在表上运行,并确保每个id仅包含1行-应该保留具有 MAX(updatedOn)的行.每个ID的 MAX(updatedOn)可以有多于一行-在这种情况下,可以保证它们是相同的,因为如果不是,那么 updatedOn 字段会被修改.

I want to have a query which will run over the table and make sure that each id has only 1 row - the row with the MAX(updatedOn) should stay. There can be more than 1 row for the MAX(updatedOn) per id - in that case it's guarantee that they are identical, because if they weren't than the updatedOn field would have been modified.

查询后,表将如下所示:

After ruuning the query the table will look like:

id  createdOn,                    updatedOn,                stock, status
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

如何编写有效执行此操作的查询?

How can I write a query that efficiently perform this?

我知道应该是这样的

DELETE FROM products
WHERE id NOT IN
(
    SELECT MAX(id)
    FROM products
    GROUP BY id
)

但是,这行不通...我没有自动递增字段来区分行.

However this won't work... I don't have auto-increment field to distinguish the rows.

我该如何解决?

推荐答案

使用 row_number 函数

DELETE FROM products
WHERE STRUCT(id,createdOn,stock) IN (
        select id,createdOn,stock from
        (SELECT id,createdOn,stock,
          ROW_NUMBER()
                  OVER (PARTITION BY createdOn,stock,updatedOn order by id) as rn from products
                  ) t where rn>1
         ) 

您可以遵循的另一种方法,而不是删除,可以替换表

Another way you can follow rather delete you can replace your table

CREATE OR REPLACE TABLE products AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY createdOn,stock,updatedOn order by id) rn
  FROM products
) 
WHERE rn> 1 

这篇关于如何在BigQuery中按时间戳删除重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆