如何在BigQuery中按时间戳删除重复的行? [英] How to remove duplicated row by timestamp in BigQuery?

查看：55 发布时间：2021/5/12 18:45:24 sql google-bigquery

本文介绍了如何在BigQuery中按时间戳删除重复的行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有以下架构的产品表:

I have a products table with the following schema:

id  createdOn, updatedOn, stock, status

createdOn & updatedOn 是 TimeStamp .

createdOn 是paratition字段.

createdOn is the paratition field.

说这是我现在拥有的数据:

Say this is the data I have now:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5

我有一个 ETL ，可以向该表追加新行.当ETL完成时，我可能会遇到相同的 id 具有多于1行的情况.

I have a ETL that append new rows to this table. when the ETL is finished I can have a situation where the same id has more than 1 row.

例如:

id  createdOn,                    updatedOn,                stock, status
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10    5
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
3   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

我想要一个查询，该查询将在表上运行，并确保每个id仅包含1行-应该保留具有 MAX(updatedOn)的行.每个ID的 MAX(updatedOn)可以有多于一行-在这种情况下，可以保证它们是相同的，因为如果不是，那么 updatedOn 字段会被修改.

I want to have a query which will run over the table and make sure that each id has only 1 row - the row with the MAX(updatedOn) should stay. There can be more than 1 row for the MAX(updatedOn) per id - in that case it's guarantee that they are identical, because if they weren't than the updatedOn field would have been modified.

查询后，表将如下所示:

After ruuning the query the table will look like:

id  createdOn,                    updatedOn,                stock, status
2   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  5     12
1   2018-09-14 14:14:24.305676   2018-09-14 14:14:24.305676  10     5
3   2018-09-14 14:14:24.305676   2018-09-15 10:00:00.000000  7     5

如何编写有效执行此操作的查询?

How can I write a query that efficiently perform this?

我知道应该是这样的

DELETE FROM products
WHERE id NOT IN
(
    SELECT MAX(id)
    FROM products
    GROUP BY id
)

但是，这行不通...我没有自动递增字段来区分行.

However this won't work... I don't have auto-increment field to distinguish the rows.

我该如何解决?

推荐答案

使用 row_number 函数

DELETE FROM products
WHERE STRUCT(id,createdOn,stock) IN (
        select id,createdOn,stock from
        (SELECT id,createdOn,stock,
          ROW_NUMBER()
                  OVER (PARTITION BY createdOn,stock,updatedOn order by id) as rn from products
                  ) t where rn>1
         )

您可以遵循的另一种方法，而不是删除，可以替换表

Another way you can follow rather delete you can replace your table

CREATE OR REPLACE TABLE products AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY createdOn,stock,updatedOn order by id) rn
  FROM products
) 
WHERE rn> 1

这篇关于如何在BigQuery中按时间戳删除重复的行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在BigQuery中按时间戳删除重复的行? [英] How to remove duplicated row by timestamp in BigQuery?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在BigQuery中按时间戳删除重复的行? [英] How to remove duplicated row by timestamp in BigQuery?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭