如何在 PostgreSQL 中进行大型非阻塞更新? [英] How do I do large non-blocking updates in PostgreSQL?

查看:42
本文介绍了如何在 PostgreSQL 中进行大型非阻塞更新?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对 PostgreSQL 中的表进行大量更新,但我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会被写入或在更新期间阅读.我想知道是否有一种简单的方法 在 psql 控制台中 可以使这些类型的操作更快.

例如,假设我有一个名为orders"的表,有 3500 万行,我想这样做:

UPDATE 命令 SET status = null;

为了避免被转移到离题的讨论中,让我们假设 3500 万列的所有状态值当前都设置为相同的(非空)值,从而使索引无用.

这个语句的问题是,生效需要很长时间(完全是因为加锁),所有改变的行都被加锁,直到整个更新完成.此更新可能需要 5 个小时,而类似

UPDATE orders SET status = null WHERE (order_id > 0 and order_id <1000000);

可能需要 1 分钟.超过 3500 万行,执行上述操作并将其分成 35 个块只需要 35 分钟,并为我节省 4 小时 25 分钟.

我可以用脚本进一步分解(此处使用伪代码):

for (i = 0 to 3500) {db_operation("更新命令SET状态=空WHERE (order_id >" + (i*1000)"+ " AND order_id <"+ ((i+1)*1000) " + ")");}

此操作可能会在几分钟内完成,而不是 35 分钟.

所以这归结为我真正要问的问题.我不想写一个该死的脚本来分解操作,每次我想做这样的一次大的更新时.有没有办法完全在 SQL 中完成我想要的?

解决方案

列/行

<块引用>

...我不需要在整个过程中维护事务完整性整个操作,因为我知道我要更改的列是不会在更新期间写入或读取.

PostgreSQL 的 MVCC 模型中的任何 UPDATEa> 写入整行的新版本.如果并发事务改变了同一行的任何列,就会出现耗时的并发问题.手册中的详细信息. 了解相同不会受到并发事务的影响,避免了一些可能的复杂情况,但不会避免其他情况.

索引

<块引用>

为了避免被转移到离题的讨论中,让我们假设当前设置了 3500 万列的所有状态值到相同的(非空)值,从而使索引无用.

更新整个表(或其中的主要部分)时,Postgres 从不使用索引.当必须读取所有或大部分行时,顺序扫描会更快.相反:索引维护意味着 UPDATE 的额外成本.

性能

<块引用>

例如,假设我有一个名为orders"的表.有 3500 万行,我想这样做:

<块引用>

UPDATE 命令 SET status = null;

我知道您的目标是更通用的解决方案(见下文).但是要解决实际问题提出的问题:这可以在几毫秒内解决,无论表大小如何:

ALTER TABLE 命令 DROP 列状态, 添加列状态文本;

手册(最高 Postgres 10):

<块引用>

当用ADD COLUMN添加一列时,表中所有现有的行用列的默认值初始化(NULL 如果没有 DEFAULT条款规定).如果没有 DEFAULT 子句,这只是元数据更改 [...]

手册(自 Postgres 11 起):

<块引用>

当一个列添加了 ADD COLUMN 和一个非易失性的 DEFAULT指定,默认值在语句时计算并将结果存储在表的元数据中.该值将被使用对于所有现有行的列.如果没有指定 DEFAULT,使用 NULL.在这两种情况下都不需要重写表.

添加具有易失性 DEFAULT 的列或更改类型现有列将需要整个表及其索引改写.[...]

还有:

<块引用>

DROP COLUMN 表单不会物理删除列,而是只是让它对 SQL 操作不可见.随后插入和表中的更新操作将为该列存储一个空值.因此,删除列很快,但不会立即减少表的磁盘大小,作为被丢弃的空间柱没有被回收.随着时间的推移,空间将被回收,因为现有行被更新.

确保您没有依赖于列的对象(外键约束、索引、视图等).您需要删除/重新创建那些.除此之外,系统目录表pg_attribute 上的微小操作就可以完成这项工作.需要在表上使用排他锁,这可能是并发负载较重的问题.(就像 Buurman 在他的 评论.)除此之外,该操作只需几毫秒.

如果您想保留列默认值,请在单独的命令中将其添加回来.在同一命令中执行此操作会立即将其应用于所有行.见:

要实际应用默认值,请考虑分批进行:

通用解决方案

dblink 已在另一个答案中提到.它允许访问远程"隐式独立连接中的 Postgres 数据库.远程"数据库可以是当前的,从而实现自治事务":函数在远程"数据库中写入的内容.db 已提交且无法回滚.

这允许运行单个函数,该函数以较小的部分更新大表,并且每个部分都单独提交.避免为大量行增加事务开销,更重要的是,在每个部分之后释放锁.这允许并发操作在没有太多延迟的情况下进行,并减少死锁的可能性.

如果您没有并发访问,这几乎没有用 - 除了在异常之后避免 ROLLBACK.对于这种情况,还可以考虑 SAVEPOINT.

免责声明

首先,很多小额交易实际上更贵.这只对大表有意义.最佳位置取决于许多因素.

如果您不确定自己在做什么:单笔交易是安全的方法.为了使其正常工作,表上的并发操作必须同时进行.例如:并发写入可以将一行移动到一个应该已经处理的分区.或者并发读取可以看到不一致的中间状态.您已收到警告.

分步说明

需要先安装附加模块dblink:

设置与 dblink 的连接很大程度上取决于您的数据库集群的设置和适当的安全策略.这可能很棘手.与更多如何与 dblink 连接相关的后续回答:

按照那里的指示创建一个 FOREIGN SERVER 和一个 USER MAPPING 以简化和简化连接(除非你已经有了).
假设 serial PRIMARY KEY 有或没有一些间隙.

创建或替换函数 f_update_in_steps()返回无效作为$func$宣布_step int;-- 步长_cur int;-- 当前 ID(从最小值开始)_max 整数;-- 最大 ID开始SELECT INTO _cur, _max min(order_id), max(order_id) FROM 订单;-- 100 个切片(步)硬编码_step := ((_max - _cur)/100) + 1;-- 圆形,可能有点太小-- +1 避免无限循环 0执行 dblink_connect('myserver');-- 按照上面的说明,您的外国服务器FOR i IN 0..200 LOOP -- 200 >>100 以确保我们超过 _max执行 dblink_exec($$UPDATE public.ordersSET 状态 = 'foo'WHERE order_id >= $$ ||_cur ||$$AND order_id <$$ ||_cur + _step ||$$并且状态与 'foo'$$ 不同);-- 避免空更新_cur := _cur + _step;退出时 _cur >_最大限度;-- 完成后停止(永远不要循环到 200)结束循环;执行 dblink_disconnect();结尾$func$ LANGUAGE plpgsql;

调用:

SELECT f_update_in_steps();

您可以根据需要对任何部分进行参数化:表名、列名、值……只要确保清理标识符以避免 SQL 注入:

避免空的更新:

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

UPDATE orders SET status = null;

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like

UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.

I could break it down even further with a script (using pseudocode here):

for (i = 0 to 3500) {
  db_operation ("UPDATE orders SET status = null
                 WHERE (order_id >" + (i*1000)"
             + " AND order_id <" + ((i+1)*1000) " +  ")");
}

This operation might complete in only a few minutes, rather than 35.

So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?

解决方案

Column / Row

... I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update.

Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.

Index

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.

Performance

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

UPDATE orders SET status = null;

I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:

ALTER TABLE orders DROP column status
                 , ADD  column status text;

The manual (up to Postgres 10):

When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]

The manual (since Postgres 11):

When a column is added with ADD COLUMN and a non-volatile DEFAULT is specified, the default is evaluated at the time of the statement and the result stored in the table's metadata. That value will be used for the column for all existing rows. If no DEFAULT is specified, NULL is used. In neither case is a rewrite of the table required.

Adding a column with a volatile DEFAULT or changing the type of an existing column will require the entire table and its indexes to be rewritten. [...]

And:

The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated.

Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.

If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:

To actually apply the default, consider doing it in batches:

General solution

dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.

This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.

If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.

Disclaimer

First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.

If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.

Step-by-step instructions

The additional module dblink needs to be installed first:

Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:

Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.

CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(order_id), max(order_id) FROM orders;
                                        -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 100) + 1;  -- rounded, possibly a bit too small
                                        -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('myserver');  -- your foreign server as instructed above

   FOR i IN 0..200 LOOP                 -- 200 >> 100 to make sure we exceed _max
      PERFORM dblink_exec(
       $$UPDATE public.orders
         SET    status = 'foo'
         WHERE  order_id >= $$ || _cur || $$
         AND    order_id <  $$ || _cur + _step || $$
         AND    status IS DISTINCT FROM 'foo'$$);  -- avoid empty update

      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

Call:

SELECT f_update_in_steps();

You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:

Avoid empty UPDATEs:

这篇关于如何在 PostgreSQL 中进行大型非阻塞更新?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆