优化 Postgres 删除孤立记录 [英] Optimize Postgres deletion of orphaned records

查看:51
本文介绍了优化 Postgres 删除孤立记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下面两张表为例:

Table "public.contacts"
       Column       |            Type             |                       Modifiers                       | Storage  | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
 id                 | integer                     | not null default nextval('contacts_id_seq'::regclass) | plain    |              |
 created_at         | timestamp without time zone | not null                                              | plain    |              |
 updated_at         | timestamp without time zone | not null                                              | plain    |              |
 external_id        | integer                     |                                                       | plain    |              |
 email_address      | character varying           |                                                       | extended |              |
 first_name         | character varying           |                                                       | extended |              |
 last_name          | character varying           |                                                       | extended |              |
 company            | character varying           |                                                       | extended |              |
 industry           | character varying           |                                                       | extended |              |
 country            | character varying           |                                                       | extended |              |
 region             | character varying           |                                                       | extended |              |
 ext_instance_id    | integer                     |                                                       | plain    |              |
 title              | character varying           |                                                       | extended |              |
Indexes:
    "contacts_pkey" PRIMARY KEY, btree (id)
    "index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)

Table "public.members"
        Column         |            Type             |                             Modifiers                              | Storage  | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
 id                    | integer                     | not null default nextval('members_id_seq'::regclass)               | plain    |              |
 step_id               | integer                     |                                                                    | plain    |              |
 contact_id            | integer                     |                                                                    | plain    |              |
 rule_id               | integer                     |                                                                    | plain    |              |
 request_id            | integer                     |                                                                    | plain    |              |
 sync_id               | integer                     |                                                                    | plain    |              |
 status                | integer                     | not null default 0                                                 | plain    |              |
 matched_targeted_rule | boolean                     | default false                                                      | plain    |              |
 external_fields       | jsonb                       |                                                                    | extended |              |
 imported_at           | timestamp without time zone |                                                                    | plain    |              |
 campaign_id           | integer                     |                                                                    | plain    |              |
 ext_instance_id       | integer                     |                                                                    | plain    |              |
 created_at            | timestamp without time zone |                                                                    | plain    |              |
Indexes:
    "members_pkey" PRIMARY KEY, btree (id)
    "index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
    "index_members_on_campaign_id" btree (campaign_id)
    "index_members_on_step_id" btree (step_id)
    "index_members_on_sync_id" btree (sync_id)
    "index_members_on_request_id" btree (request_id)
    "index_members_on_status" btree (status)

主键和 members.contact_id 都存在索引.

Indices exist for both primary keys and members.contact_id.

我需要删除任何没有相关成员contact.大概有3MMcontact和25MMmember记录.

I need to delete any contact which has no related members. There are roughly 3MM contact and 25MM member records.

我正在尝试以下两个查询:

I'm attempting the following two queries:

DELETE FROM "contacts"
WHERE  "contacts"."id" IN (SELECT "contacts"."id" 
                           FROM   "contacts" 
                                  LEFT OUTER JOIN members 
                                               ON 
                                  members.contact_id = contacts.id 
                           WHERE  members.id IS NULL);

DELETE 0
Time: 173033.801 ms

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
   ->  Nested Loop  (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
         ->  HashAggregate  (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
               Group Key: contacts_1.id
               ->  Hash Right Join  (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
                     Hash Cond: (members.contact_id = contacts_1.id)
                     Filter: (members.id IS NULL)
                     Rows Removed by Filter: 26725870
                     ->  Seq Scan on members  (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
                     ->  Hash  (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
                           Buckets: 262144  Batches: 4  Memory Usage: 34361kB
                           ->  Seq Scan on contacts contacts_1  (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
         ->  Index Scan using contacts_pkey on contacts  (cost=0.43..1.48 rows=1 width=10) (never executed)
               Index Cond: (id = contacts_1.id)
 Planning time: 0.488 ms
 Execution time: 188718.862 ms

查询 2:

DELETE FROM contacts 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   members c 
                   WHERE  c.contact_id = contacts.id); 

DELETE 0
Time: 170871.219 ms

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
   ->  Hash Anti Join  (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
         Hash Cond: (contacts.id = c.contact_id)
         ->  Seq Scan on contacts  (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
         ->  Hash  (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
               Buckets: 262144  Batches: 32  Memory Usage: 36228kB
               ->  Seq Scan on members c  (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
 Planning time: 0.901 ms
 Execution time: 177524.526 ms

如您所见,即使不删除任何记录,这两个查询都显示出相似的性能,耗时约 3 分钟.

As you can see that without even deleting any records both queries show similar performance taking ~3 minutes.

服务器磁盘 I/O 峰值达到 100%,所以我假设数据正在溢出到磁盘,因为对 contactsmembers.

The server disk I/O spikes to 100% so I'm assuming that data is being spilled out to the disk because a sequential scan is done on both contacts and members.

服务器是 EC2 r3.large(15GB RAM).

The server is an EC2 r3.large (15GB RAM).

关于我可以做些什么来优化此查询的任何想法?

Any ideas on what I can do to optimize this query?

在对两个表运行 vacuum analysis 并确保 enable_mergejoin 设置为 on 后,查询时间没有差异:

After running vacuum analyze for both tables and ensuring enable_mergejoin is set to on there is no difference in the query time:

DELETE FROM contacts 
WHERE  NOT EXISTS (SELECT 1 
                   FROM   members c 
                   WHERE  c.contact_id = contacts.id); 

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on contacts  (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
   ->  Hash Anti Join  (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
         Hash Cond: (contacts.id = c.contact_id)
         ->  Seq Scan on contacts  (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
         ->  Hash  (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
               Buckets: 262144  Batches: 32  Memory Usage: 37006kB
               ->  Seq Scan on members c  (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
 Planning time: 0.328 ms
 Execution time: 209408.040 ms

更新 2:

PG 版本:

PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit

关系大小:

         Table         |  Size   | External Size
-----------------------+---------+---------------
 members               | 23 GB   | 11 GB
 contacts              | 944 MB  | 371 MB

设置:

 work_mem
----------
 64MB

 random_page_cost
------------------
 4

更新 3:

尝试分批执行此操作似乎对 I/O 使用率没有帮助(仍会飙升至 100%),并且尽管使用了基于索引的计划,但似乎也没有按时改善.

Update 3:

Experimenting with doing this in batches doesn't seem to help out on the I/O usage (still spikes to 100%) and doesn't seem to improve on time despite using index-based plans.

DO $do$ 
BEGIN 
  FOR i IN 57..668 
  LOOP 
    DELETE 
    FROM   contacts 
    WHERE  contacts.id IN 
           ( 
                           SELECT          contacts.id 
                           FROM            contacts 
                           left outer join members 
                           ON              members.contact_id = contacts.id 
                           WHERE           members.id IS NULL 
                           AND             contacts.id >= (i    * 10000) 
                           AND             contacts.id < ((i+1) * 10000));
END LOOP;END $do$;

我不得不在 Time: 1203492.326 ms 之后终止查询,并且在查询运行的整个过程中磁盘 I/O 保持在 100%.我还对 1,000 和 5,000 个块进行了试验,但没有看到性能有任何提高.

I had to kill the query after Time: 1203492.326 ms and disk I/O stayed at 100% the entire time the query ran. I also experimented with 1,000 and 5,000 chunks but did not see any increase in performance.

注意:使用 57..668 范围是因为我知道这些是现有的联系人 ID.(例如 min(id)max(id))

Note: The 57..668 range was used because I know those are existing contact IDs. (E.g. min(id) and max(id))

推荐答案

关于我可以做些什么来优化此查询的任何想法?

Any ideas on what I can do to optimize this query?

您的查询是完美的.我会使用 NOT EXISTS 变体.

Your queries are perfect. I would use the NOT EXISTS variant.

你的索引index_members_on_contact_id_and_step_id也很适合它:

但请参阅下文有关 BRIN 索引的信息.

But see below about BRIN indexes.

您可以调整服务器、表和索引配置.

You can tune your server, table and index configuration.

由于您实际上并没有更新或删除很多行(根据您的评论,几乎没有任何行?),您需要优化读取性能.

Since you do not actually update or delete many rows (hardly any at all, according to your comment?), you need to optimize read performance.

您提供了:

服务器是 EC2 r3.large(15GB RAM).

The server is an EC2 r3.large (15GB RAM).

还有:

PostgreSQL 9.4.4

PostgreSQL 9.4.4

您的版本已严重过时.至少升级到最新的次要版本.更好的是,升级到当前的主要版本.Postgres 9.5 和 9.6 为大数据带来了重大改进 - 这正是您所需要的.

Your version is seriously outdated. At least upgrade to the latest minor version. Better yet, upgrade to the current major version. Postgres 9.5 and 9.6 brought major improvements for big data - which is what you need exactly.

考虑项目的版本政策.

亚马逊允许您升级!

在基本顺序扫描中,预期行数和实际行数之间存在 10% 的意外不匹配:

There is an unexpected 10% mismatch between expected and actual row count in the basic sequential scan:

成员 c 上的序列扫描(成本=0.00..1814029.74 行=24855474 宽度=10)(实际时间=1.132..188654.555 行=27307060 循环=1)

Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)

一点都不戏剧化,但在这个查询中仍然不应该出现.表示您可能需要调整自动清理设置 - 对于非常大的表,可能需要调整每个表.

Not dramatic at all, but still should not occur in this query. Indicates that you might have to tune your autovacuum settings - possibly per table for the very big ones.

更多问题:

Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12)(实际时间=209406.338..209406.338 rows=0 loops=1)

Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)

Postgres 期望找到 1875003 行要删除,而实际上找到了 0 行.这是出乎意料的.也许大幅增加 members.contact_idcontacts.id 的统计目标可以帮助减少差距,这可能会允许更好的查询计划.见:

Postgres expects to find 1875003 rows to delete, while actually 0 rows are found. That's unexpected. Maybe substantially increasing the statistics target on members.contact_id and contacts.id can help to decrease the gap, which might allow better query plans. See:

members 中大约 25MM 的行占用 23 GB - 每行几乎 1kb,这对于您提供的表定义来说似乎过多(即使您提供的总大小应包括索引):

Your ~ 25MM rows in members occupy 23 GB - that's almost 1kb per row, which seems excessive for the table definition you presented (even if the total size you provided should include indexes):

 4 bytes  item identifier

24        tuple header
 8        null bitmap
36        9x integer
16        2x ts
 1        1x bool
??        1x jsonb

见:

每行 89 个字节 - 或更少,带有一些 NULL 值 - 几乎没有任何对齐填充,所以 最多 96 个字节,加上您的 jsonb 列.

That's 89 bytes per row - or less with some NULL values - and hardly any alignment padding, so 96 bytes max, plus your jsonb column.

要么 jsonb 列非常大,这使我建议将数据规范化为单独的列或单独的表.考虑:

Either that jsonb column is very big which would make me suggest to normalize the data into separate columns or a separate table. Consider:

或者你的表很臃肿,这可以用VACUUM FULL ANALYZE解决,或者在它的时候:

Or your table is bloated, which can be solved with VACUUM FULL ANALYZE or, while being at it:

CLUSTER members USING index_members_on_contact_id_and_step_id;
VACUUM members;

但是要么在表上使用排他锁,您说您负担不起.pg_repack 无需排他锁即可完成.见:

But either takes an exclusive lock on the table, which you say you cannot afford. pg_repack can do it without exclusive lock. See:

即使我们考虑索引大小,您的表似乎也太大了:您有 7 个小索引,每行 36 - 44 个字节,没有膨胀,NULL 值更少,所以 <总共 300 字节.

Even if we factor in index sizes, your table seems too big: you have 7 small indexes, each 36 - 44 bytes per row without bloat, less with NULL values, so < 300 bytes altogether.

无论哪种方式,请考虑更积极的自动清理表 members 的设置.相关:

Either way, consider more aggressive autovacuum settings for your table members. Related:

和/或停止膨胀表格开始.你经常更新行吗?您更新了很多特定的专栏吗?那个 jsonb 列可能吗?您可以将它移到一个单独的 (1:1) 表中,以停止使用死元组使主表膨胀 - 并防止 autovacuum 完成其工作.

And / or stop bloating the table to begin with. Are you updating rows a lot? Any particular column you update a lot? That jsonb column maybe? You might move that to a separate (1:1) table just to stop bloating the main table with dead tuples - and keeping autovacuum from doing its job.

块范围索引需要 Postgres 9.5 或更高版本显着 减少索引大小.我的初稿太乐观了.如果每个 contact.idmembers 中有许多行,BRIN 索引对于您的用例来说是完美 -在物理聚类您的表至少一次之后(有关拟合 CLUSTER 命令,请参阅 ③).在这种情况下,Postgres 可以快速排除整个数据页.但是您的数字仅表示每个 contact.id 大约有 8 行,因此数据页通常会包含多个值,这会使大部分效果无效.取决于您的数据分布的实际细节...

Block range indexes require Postgres 9.5 or later and dramatically reduce index size. I was too optimistic in my first draft. A BRIN index is perfect for your use case if you have many rows in members for each contact.id - after physically clustering your table at least once (see ③ for the fitting CLUSTER command). In that case Postgres can rule out whole data pages quickly. But your numbers indicate only around 8 rows per contact.id, so data pages would often contain multiple values, which voids much of the effect. Depends on actual details of your data distribution ...

另一方面,就目前而言,您的元组大小约为 1 kb,因此每个数据页只有 ~ 8 行(通常为 8 kb).如果这不是主要的膨胀,那么 BRIN 索引毕竟可能会有所帮助.

On the other hand, as it stands, your tuple size is around 1 kb, so only ~ 8 rows per data page (typically 8kb). If that isn't mostly bloat, a BRIN index might help after all.

但是您需要先升级您的服务器版本.见①.

But you need to upgrade your server version first. See ①.

CREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);

这篇关于优化 Postgres 删除孤立记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆