如何在 Presto 中进行重复数据删除 [英] How to deduplicate in Presto

查看:52
本文介绍了如何在 Presto 中进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Presto 表,假设它有 [id, name, update_time] 列和数据

(1, Amy, 2018-08-01),(1, 艾米, 2018-08-02),(1, Amyyyyyyy, 2018-08-03),(2, 鲍勃, 2018-08-01)

现在,我想执行一个sql,结果是

(1, Amyyyyyyy, 2018-08-03),(2, 鲍勃, 2018-08-01)

目前,我在 Presto 中进行重复数据删除的最佳方法如下.

选择t1.id,t1.name,t1.update_time从表名 t1加入(选择 id, max(update_time) as update_time from table_name group by id)t2在 t1.id = t2.id 和 t1.update_time = t2.update_time

更多信息,例如 解决方案

在 PrestoDB 中,我倾向于使用 row_number():

选择id、姓名、日期从(选择 t.*,row_number() over (partition by name order by date desc) as seqnum从 table_name t) t其中seqnum = 1;

I have a Presto table assume it has [id, name, update_time] columns and data

(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

Now, I want to execute a sql and the result will be

(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

Currently, my best way to deduplicate in Presto is below.

select 
    t1.id, 
    t1.name,
    t1.update_time 
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
    on t1.id = t2.id and t1.update_time = t2.update_time

More information, clike deduplication in sql

Is there a better way to deduplicate in Presto?

解决方案

In PrestoDB, I would be inclined to use row_number():

select id, name, date
from (select t.*,
             row_number() over (partition by name order by date desc) as seqnum
      from table_name t
     ) t
where seqnum = 1;

这篇关于如何在 Presto 中进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆