使用PostgreSQL中的多个表进行大更新 [英] Doing a big update using multiple tables in PostgreSQL

查看:197
本文介绍了使用PostgreSQL中的多个表进行大更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几周前,我开始与Postgres合作,在过去的几天里,我一直在尝试解决该问题,但结果不一:我有一个表(1000万行),每个月需要更新一次信息存储在其他一些表中。在此更新过程中,没有人可以访问数据库,因此没有人可以读取/写入数据。在那段时间内,我将是唯一的数据库用户。

A few weeks ago I started working with Postgres and I've been trying to solve this problem for the last few days with mixed results: I have a table (10 million rows) that needs to be updated once every month with information stored in a few other tables. During this update no-one has access to the database, so no-one is reading / writing data. I will be the only DB user during that time.

表A有1000万行。一栏(重建成本)需要更新。
表B,表C和表D为表A中的每一行重新计算了重建成本(因此表B,C和D各自具有1000万行)。哪个表用于更新表A取决于该值,请参见下文。我一直在尝试使用此查询进行此更新:

Table A has 10 million rows. One column (rebuilding costs) needs to be updated. Table B, table C and table D have newly calculated rebuilding costs for every row in Table A (so table B, C and D each have 10 million rows). Which table is used for updating Table A depends on the value, see below. I've been trying to do this update with this query:

UPDATE table_A a
SET rebuilding_costs = 
    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END
FROM table_B b
LEFT OUTER JOIN table_C c
    ON (b.address = c.address)
LEFT OUTER JOIN table_D d
    ON (b.address = d.address)
WHERE a.address = b.address
;

此查询给出正确的结果,但是有点慢(25分钟)。有趣的是:当表A具有100万行(而不是1000万行)时,只需30秒。因此,在具有1000万行的表上运行此查询时,我希望该查询运行5分钟,但是却花了25分钟。多数民众赞成在我尝试在Blocks中进行UPDATE时,于是我在WHERE子句中将此行添加到查询中:

This query gives the right result, but it is a little slow (25 minutes). The funny thing is: when table A has 1 million rows (instead of 10 million) it only takes 30 seconds. So when running this query on a table with 10 million rows I expected the query to run for 5 minutes, but it took 25 minutes instead. Thats when I tried to UPDATE in Blocks, so I added this line to the query in the WHERE clause:

AND (a.id > 0 AND a.id < 1000000)

a.id是表的主键答:在此示例中,仅更新表A的前一百万行。但是,它花了3分钟才能完成。您将必须执行10次才能更新所有1000万行,因此将需要30分钟。.

a.id is the primary key of table A. In this example only the first million rows of table A get updated. However, it took 3 minutes to complete. You would have to do it 10 times to update all 10 million rows so that would be 30 minutes..

我还试图防止查询对某行进行处理,通过添加以下内容不会被UPDATE更改:

I also tried to prevent the query from doing someone with rows that are not going to be changed by the UPDATE by adding this:

AND a.herbouwwaarde_indicatie IS DISTINCT FROM b.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM c.inhoud
AND a.herbouwwaarde_indicatie IS DISTINCT FROM d.inhoud

将fillfactor设置为70和50确实有助于提高速度,但是在20分钟内我无法达到它。

Setting the fillfactor to 70 and 50 did help to improve the speed a little bit, but I couldnt get it under 20 minutes.

我也尝试重新创建表A,例如:

I also tried to recreate table A, like:

CREATE TABLE table_A_new

AS

SELECT a.address, 

    CASE
    WHEN b.rebuilding_costs BETWEEN 100000 AND 200000 THEN b.rebuilding_costs 
    WHEN c.rebuilding_costs BETWEEN 200001 AND 400000 THEN c.rebuilding_costs 
    WHEN d.rebuilding_costs BETWEEN 400001 AND 600000 THEN d.rebuilding_costs 
    ELSE NULL
    END rebuildingcosts

FROM table_A a
LEFT OUTER JOIN table_B b
    ON (a.address = b.address)
LEFT OUTER JOIN table_C c
    ON (a.address = c.address)
LEFT OUTER JOIN table_D d
    ON (a.address = d.address)
;

这非常快(2分钟),并且给出了正确的结果。但是,您必须重新创建整个表。似乎效率不高(并占用大量存储空间),这就是为什么我开始使用UPDATE查询的原因。

This is very fast (2 minutes) and gives the right result. However, you have to recreate an entire table. Somehow that doesnt seem very efficient (and takes a lot of storage space), that's why I started out with UPDATE query's.

我的问题是:最好的处理方法是什么?有没有一种方法可以提高UPDATE查询的性能,还是最好在上面的示例中寻找一种替代方法,例如创建表。

My question is: what is the best way to proceed? Is there a way to improve the performance of the UPDATE query, or would it be best to look for an alternative such as the 'create table' one in the example above.

虽然我必须使用Postrgres,但是我不能切换到其他DBMS。

I have to use Postrgres though, I cant switch to a different DBMS.

这是UPDATE查询的执行计划(现在表没有10百万行,但大约6-8百万行):

This is the execution plan of the UPDATE query (right now the tables don't have 10 million rows, but approximately 6 - 8 million rows):

Update on tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65)
->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65)
      Hash Cond: ((b.adres)::text = a.adres)"
Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) 
AND (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
AND (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
    ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39)
          Hash Cond: ((b.adres)::text = (d.adres)::text)
          ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29)
                Hash Cond: ((c.adres)::text = (b.adres)::text)
                ->  Seq Scan on Tabel_C c (cost=0.00..240642.35 rows=7600735 width=19)
                ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19)
                      ->  Seq Scan on tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19)
          ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19)"
                ->  Seq Scan on tabel D d  (cost=0.00..103436.52 rows=6709052 width=19)"
     ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39)"
          ->  Seq Scan on Tabel_A a (cost=0.00..217261.00 rows=8000000 width=39)"

此是解释分析输出:

Update on Tabel_A a  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual time=2815452.997..2815452.997 rows=0 loops=1)
  ->  Hash Join  (cost=902288.27..2150690.80 rows=6714762 width=65) (actual      time=108861.999..214888.780 rows=5252864 loops=1)
        Hash Cond: ((b.adres)::text = a.adres)
        Join Filter: ((b.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs) AND  
  (c.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
  (d.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs))
         Rows Removed by Join Filter: 670998
        ->  Hash Left Join  (cost=522527.27..1318059.42 rows=6716471 width=39) (actual time=43138.635..116933.803 rows=6711432 loops=1)"
              Hash Cond: ((b.adres)::text = (d.adres)::text)"
              ->  Hash Right Join  (cost=295916.60..817658.93 rows=6716471 width=29) (actual time=34571.750..99040.256 rows=6710550 loops=1)"
                    Hash Cond: ((c.adres)::text = (b.adres)::text)"
                    ->  Seq Scan on Tabel_C c  (cost=0.00..240642.35 rows=7600735 width=19) (actual time=127.080..59703.935 rows=7595083 loops=1)"
                    ->  Hash  (cost=172605.71..172605.71 rows=6716471 width=19) (actual time=29925.787..29925.787 rows=6709229 loops=1)"
                          Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                          ->  Seq Scan on Tabel_B b  (cost=0.00..172605.71 rows=6716471 width=19) (actual time=0.017..27245.069 rows=6709229 loops=1)"
              ->  Hash  (cost=103436.52..103436.52 rows=6709052 width=19) (actual     time=8566.848..8566.848 rows=6709229 loops=1)"
                    Buckets: 2048  Batches: 512  Memory Usage: 678kB"
                    ->  Seq Scan on Tabel_D d (cost=0.00..103436.52 rows=6709052 width=19) (actual time=0.009..5970.010 rows=6709229 loops=1)"
        ->  Hash  (cost=217261.00..217261.00 rows=8000000 width=39) (actual      time=65721.815..65721.815 rows=8000000 loops=1)"
               Buckets: 2048  Batches: 1024  Memory Usage: 612kB"
              ->  Seq Scan on Tabel_A a  (cost=0.00..217261.00 rows=8000000 width=39) (actual time=0.056..55968.171 rows=8000000 loops=1)"
Total runtime: 2815453.549 ms"

表A,B,C和D在地址列(用于连接的列)上具有所有索引。表A确实有一个主键(id),表B,C和D有5至7列在此过程中没有使用。

Table A, B, C and D have all indexes on the address column, the column that is being used to join on. Te tables have no other indexes. Table A does have a primary key (id). Table B, C and D have between 5 and 7 columns extra that are not being used in this process.

推荐答案

此查询应产生相同的结果,但避免使用 CASE

This query should produce the same results, but avoids the CASE

-- EXPLAIN ANALYZE
UPDATE table_a a
SET rebuilding_costs = drie.rebuilding_costs
FROM (
        SELECT COALESCE(b.address, c.address, d.address) AS address
        , COALESCE(b.rebuilding_costs, c.rebuilding_costs,  d.rebuilding_costs)
            AS rebuilding_costs
        FROM table_b b
        FULL OUTER JOIN table_c c ON c.address = b.address
                AND  c.rebuilding_costs BETWEEN 200001 AND 400000
        FULL OUTER JOIN table_D d ON d.address = b.address
                AND  d.rebuilding_costs BETWEEN 400001 AND 600000
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        ) drie
WHERE a.address = drie.address
AND a.rebuilding_costs <> drie.rebuilding_costs -- Avoid useless updates
        ;






更新:基于链式的类似方法一组CTE:


UPDATE: A similar approach, based on a chained set of CTEs:

-- --------------------------------
EXPLAIN ANALYZE
WITH cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        )
,       cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        )
,       cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        )
,       cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        )
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
-- avoid useless updates this way:
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs
-- ,or this way:
-- AND cte_bcd.rebuilding_costs IS DISTINCT FROM a.rebuilding_costs
   ;






UPDATE2:
CTE可能很慢,因为它们充当优化障碍。一种快速而肮脏的方法是将tem重写为(临时)VIEW,然后引用它们。这使优化器可以将查询的各个部分混入和移出子查询,甚至可以合并并重用它们。


UPDATE2: CTE's can be slow, because they act as optimisation barriers. A quick&dirty way is to rewrite tem as (temporary) VIEWs, and refer to these instead. This allows the optimiser to shuffle the various parts of the query into and out off subqueries, and even combine and reuse these.

CREATE TEMP VIEW cte_b AS (
        SELECT b.address,  b.rebuilding_costs
        FROM table_b b
        WHERE b.rebuilding_costs BETWEEN 100001 AND 200000
        );
CREATE TEMP VIEW        cte_c AS (
        SELECT c.address , c.rebuilding_costs
        FROM table_c c
        WHERE  c.rebuilding_costs BETWEEN 200001 AND 400000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = c.address)
        );
CREATE TEMP VIEW        cte_d AS (
        SELECT d.address , d.rebuilding_costs
        FROM table_d d
        WHERE  d.rebuilding_costs BETWEEN 400001 AND 600000
        AND NOT EXISTS (SELECT * FROM cte_b WHERE cte_b.address = d.address)
        AND NOT EXISTS (SELECT * FROM cte_c WHERE cte_c.address = d.address)
        );
CREATE TEMP VIEW        cte_bcd AS (
        SELECT           cte_b.address,  cte_b.rebuilding_costs FROM cte_b
        UNION ALL SELECT cte_c.address,  cte_c.rebuilding_costs FROM cte_c
        UNION ALL SELECT cte_d.address,  cte_d.rebuilding_costs FROM cte_d
        );
EXPLAIN -- ANALYZE
UPDATE table_a a
SET rebuilding_costs = cte_bcd.rebuilding_costs
FROM cte_bcd
WHERE a.address = cte_bcd.address
AND a.rebuilding_costs <> cte_bcd.rebuilding_costs -- avoid useless updates
-- AND a.address < 100000
        ;

这篇关于使用PostgreSQL中的多个表进行大更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆