Postgresql - 如何加快更新大表(1 亿行)? [英] Postgresql - How to speed up for updating huge table(100 million rows)?

查看:29
本文介绍了Postgresql - 如何加快更新大表(1 亿行)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两张大桌子:

Table "public.tx_input1_new" (100,000,000 rows) 

     Column     |            Type             | Modifiers
----------------|-----------------------------|----------
 blk_hash       | character varying(500)      |
 blk_time       | timestamp without time zone |
 tx_hash        | character varying(500)      |
 input_tx_hash  | character varying(100)      |
 input_tx_index | smallint                    |
 input_addr     | character varying(500)      |
 input_val      | numeric                     |

Indexes:
    "tx_input1_new_h" btree (input_tx_hash, input_tx_index) 

<小时>

Table "public.tx_output1_new" (100,000,000 rows)

    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 tx_hash      | character varying(100) |
 output_addr  | character varying(500) |
 output_index | smallint               |
 input_val    | numeric                |

Indexes:
    "tx_output1_new_h" btree (tx_hash, output_index)

我想通过另一个表来更新table1:

I want to update table1 by the other table:

UPDATE tx_input1 as i
SET 
  input_addr = o.output_addr,
  input_val = o.output_val
FROM tx_output1 as o
WHERE 
  i.input_tx_hash = o.tx_hash
  AND i.input_tx_index = o.output_index;

在我执行这个 SQL 命令之前,我已经为这两个表创建了索引:

Before I execute this SQL command, I already created the index for this two table:

CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);

CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);

我使用 EXPLAIN 命令查看查询计划,但它没有使用我创建的索引.

I use EXPLAIN command to see the query plan, but it didn't use the index I created.

完成此UPDATE 大约需要 14-15 小时.

It took about 14-15 hours to complete this UPDATE.

里面有什么问题?

如何缩短执行时间,或调整我的数据库/表?

How can I shorten the execution time, or tune my database/table?

谢谢.

推荐答案

由于你要连接两个大表,并且没有可以过滤掉行的条件,唯一有效的连接策略是哈希连接,没有索引可以帮忙.

Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.

首先将对其中一个表进行顺序扫描,从中构建哈希结构,然后对另一个表进行顺序扫描,并针对找到的每一行探测哈希值.任何索引对此有何帮助?

First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?

您可以预计这样的操作会花费很长时间,但有一些方法可以加快操作速度:

You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:

  • 在开始之前删除 tx_input1 上的所有索引和约束.您的查询是索引根本没有帮助但实际上损害性能的示例之一,因为索引必须与表一起更新.完成UPDATE 后重新创建索引和约束.根据表上的索引数量,您可以期待获得不错的性能提升.

  • Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.

使用 SET 命令尽可能增加这一操作的 work_mem 参数.哈希操作可以使用的内存越多,速度就越快.对于这么大的表,您最终可能仍会拥有临时文件,但您仍然可以期待获得不错的性能提升.

Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.

checkpoint_segments(或从 9.6 版开始的 max_wal_size)增加到一个较高的值,以便在 UPDATE 期间有更少的检查点操作.

Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.

确保两个表上的表统计信息准确无误,以便 PostgreSQL 能够对要创建的哈希桶的数量做出良好的估计.

Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.

UPDATE之后,如果影响的行数比较大,可以考虑在tx_input1上运行VACUUM(FULL)得到摆脱由此产生的表膨胀.这将锁定表更长的时间,因此请在维护窗口期间执行此操作.它将减小表的大小,从而加快顺序扫描.

After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

这篇关于Postgresql - 如何加快更新大表(1 亿行)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆