在大型表上加快Postgres更新 [英] Speed up Postgres Update on Large Table

查看:111
本文介绍了在大型表上加快Postgres更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询,需要更新具有约1400万条记录的表。它正在获得需要通过联接表从另一个表更新的值。像这样...

I have a query that needs to update a table with ~14 million records. It's getting the value it needs to update from another table via a join table. Like this...

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
  ->  Nested Loop  (cost=731559.58..133068628689.23 rows=10645372213165 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=731559.58..1135721.70 rows=717709 width=26)
              ->  Hash Join  (cost=731559.58..1132133.16 rows=717709 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=397447.29..397447.29 rows=14832429 width=14)
                          ->  Seq Scan on listings a  (cost=0.00..397447.29 rows=14832429 width=14)
                                Filter: (master_ext_id IS NULL)

显然,查看执行计划,您会发现此查询花费了非常长的时间。现在,我假设它与查询中涉及的行数有关,但是我需要一种以某种方式加快此速度的方法。

Obviously, looking at the execution plan, you can see that this query is taking an extremely long time. I'm assuming at this point that it has to do with the number of rows that are involved in the query, but I need a way to speed this up somehow.

除了在listings表中有约1400万条记录之外,listing_to_external_id表中还有约1500万行,external_ids表中有约1500万行。

In addition to the ~14 million records in the listings table, there are ~15 million rows in the listing_to_external_id table and ~15 million in the external_ids table.

尝试将enable_seqscan设置为off,它使用我创建的索引,所以我知道这只是计划者确定seq扫描会更快的一种情况。我还对表进行了 ANALYZE

I've tried setting enable_seqscan to off, and it uses the indexes I've created, so I know it's just a case of the planner determining that a seq scan would be faster. I've also ANALYZE'd my tables.

我尝试通过使用主数据库来限制行的更新。清单列表上的键,希望我一次就能遍历并更新几行。如您所见,这几乎没有效果...

I've tried limiting the rows updated by using the primary key on the listings table, hoping I might be able to loop through and update the rows a handful at a time. As you can see, this had little effect...

UPDATE listings
SET master_ext_id = c.id
FROM listings a
    JOIN listing_to_external_id b on a.id = b.listing_id
    JOIN external_ids c on b.external_id = c.id
WHERE a.id >= 34649050
AND a.id <= 35649050
AND a.master_ext_id is null
AND c.provider_id = 0

Update on listings  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
  ->  Nested Loop  (cost=212130.40..9379727588.60 rows=750294018398 width=637)
        ->  Seq Scan on listings  (cost=0.00..397447.29 rows=14832429 width=611)
        ->  Materialize  (cost=212130.40..600005.71 rows=50585 width=26)
              ->  Hash Join  (cost=212130.40..599752.78 rows=50585 width=26)
                    Hash Cond: (b.listing_id = a.id)
                    ->  Hash Join  (cost=148706.93..526852.10 rows=717709 width=28)
                          Hash Cond: (b.external_id = c.id)
                          ->  Seq Scan on listing_to_external_id b  (cost=0.00..236589.51 rows=15357551 width=22)
                          ->  Hash  (cost=139735.49..139735.49 rows=717715 width=14)
                                ->  Index Scan using ei_provider_id on external_ids c  (cost=0.00..139735.49 rows=717715 width=14)
                                      Index Cond: (provider_id = 0)
                    ->  Hash  (cost=50355.96..50355.96 rows=1045401 width=14)
                          ->  Index Scan using listings_pkey on listings a  (cost=0.00..50355.96 rows=1045401 width=14)
                                Index Cond: ((id >= 34649050) AND (id <= 35649050))
                                Filter: (master_ext_id IS NULL)

我尝试调整Postgres的设置以更好地处理这么大的查询,但这似乎也没有什么效果。如果查询本身无法做任何事情,我可以进入这些设置。

I've tried tuning the settings on Postgres to better handle such a large query, but this seemed to have little effect as well. I can get into these settings if nothing can be done with the query itself.

我还尝试将listing_to_external_id和external_ids之间的连接结果放入其中一个表,对其进行索引,然后将其加入该表中的清单。这导致执行计划/成本非常相似。

I've also tried taking the result of the join between listing_to_external_id and external_ids and putting it into a table, indexing it, and then joining listings on that table. This resulted in a very similar execution plan/cost.

目前还不确定还要做什么。只要让查询在周末运行,它仍然在运行。

Not sure what else to do at this point. Just let the query run over the weekend and it's still running. Any suggestions?

推荐答案

您两次使用 listings 表-在 UPDATE FROM 中的另一个。查看第一个执行计划。它具有列表的笛卡尔乘积(CROSS JOIN)。仅在 UPDATE 中需要列表

You used listings table twice - one in the UPDATE and other one in FROM. Look at the first execution plan. It has a cartesian product (CROSS JOIN) of listings. You need listings only in UPDATE.

尝试一些操作如

UPDATE listings a
SET master_ext_id = c.id
FROM listing_to_external_id b
JOIN external_ids c on b.external_id = c.id
WHERE a.id = b.listing_id
 AND a.master_ext_id is null
 AND c.provider_id = 0

这篇关于在大型表上加快Postgres更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆