删除postgres中的重复项 [英] Delete duplicates in postgres

查看:108
本文介绍了删除postgres中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要删除给定重复"external_id"的除一行以外的所有内容.下面的查询需要大约两分钟的时间来运行我的5,000,000行表,而且我觉得必须有一种更快的方式来执行此任务. "id"是主键,"external_id"是btree索引列:

I want to delete all but one row for a given duplicate "external_id". The query below takes about two minutes to run for my table of 5,000,000 rows, and I feel like there's got to be a quicker way of performing this task. "id" is the primary key, and "external_id" is a btree indexed column:

delete from posts p1 using (select distinct on (1)
        external_id, id
        from posts
        order by 1 desc, 2 desc) p_recent 
    where p1.external_id = p_recent.external_id
    and p1.id != p_recent.id;

如何提高此查询的性能?

How can I improve the performance of this query?

下面的查询计划:

Delete on posts p1  (cost=2322413.28..2673548.11 rows=5583248 width=45) (actual time=148064.026..148064.026 rows=0 loops=1)
   ->  Hash Join  (cost=2322413.28..2673548.11 rows=5583248 width=45) (actual time=148064.025..148064.025 rows=0 loops=1)
         Hash Cond: ((p_recent.external_id)::text = (p1.external_id)::text)
         Join Filter: (p1.id <> p_recent.id)
         ->  Subquery Scan on p_recent  (cost=1565918.17..1649666.91 rows=5583249 width=54) (actual time=80975.573..98202.920 rows=5947083 loops=1)
               ->  Unique  (cost=1565918.17..1593834.42 rows=5583249 width=15) (actual time=80975.561..95891.264 rows=5947083 loops=1)
                     ->  Sort  (cost=1565918.17..1579876.30 rows=5583249 width=15) (actual time=80975.560..93768.105 rows=5947083 loops=1)
                           Sort Key: posts.external_id, posts.id
                           Sort Method: external merge  Disk: 153984kB
                           ->  Seq Scan on posts  (cost=0.00..653989.49 rows=5583249 width=15) (actual time=0.014..10314.089 rows=5947083 loops=1)
         ->  Hash  (cost=653989.49..653989.49 rows=5583249 width=21) (actual time=38966.573..38966.573 rows=5947083 loops=1)
               Buckets: 4096  Batches: 256  Memory Usage: 1017kB
               ->  Seq Scan on posts p1  (cost=0.00..653989.49 rows=5583249 width=21) (actual time=0.028..35863.561 rows=5947083 loops=1)
 Total runtime: 148084.796 ms

推荐答案

DELETE from posts del 
WHERE EXISTS (
        SELECT *
        FROM posts ex
        WHERE ex.external_id = del.external_id
        AND ex.id < del.id -- if you want to keep the lowest id
        -- AND ex.id > del.id -- if you want to keep the highest id
        );

这篇关于删除postgres中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆