如何在 Presto 中进行重复数据删除 [英] How to deduplicate in Presto
本文介绍了如何在 Presto 中进行重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个 Presto 表,假设它有 [id, name, update_time] 列和数据
(1, Amy, 2018-08-01),(1, 艾米, 2018-08-02),(1, Amyyyyyyy, 2018-08-03),(2, 鲍勃, 2018-08-01)
现在,我想执行一个sql,结果是
(1, Amyyyyyyy, 2018-08-03),(2, 鲍勃, 2018-08-01)
目前,我在 Presto 中进行重复数据删除的最佳方法如下.
选择t1.id,t1.name,t1.update_time从表名 t1加入(选择 id, max(update_time) as update_time from table_name group by id)t2在 t1.id = t2.id 和 t1.update_time = t2.update_time
更多信息,例如 解决方案
在 PrestoDB 中,我倾向于使用 row_number()
:
选择id、姓名、日期从(选择 t.*,row_number() over (partition by name order by date desc) as seqnum从 table_name t) t其中seqnum = 1;
I have a Presto table assume it has [id, name, update_time] columns and data
(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
Now, I want to execute a sql and the result will be
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)
Currently, my best way to deduplicate in Presto is below.
select
t1.id,
t1.name,
t1.update_time
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
on t1.id = t2.id and t1.update_time = t2.update_time
More information, clike deduplication in sql
Is there a better way to deduplicate in Presto?
解决方案
In PrestoDB, I would be inclined to use row_number()
:
select id, name, date
from (select t.*,
row_number() over (partition by name order by date desc) as seqnum
from table_name t
) t
where seqnum = 1;
这篇关于如何在 Presto 中进行重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文