COPY如何工作?为什么它比INSERT这么快? [英] How does COPY work and why is it so much faster than INSERT?

查看:157
本文介绍了COPY如何工作?为什么它比INSERT这么快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天,我整天都在改善Python脚本的性能,该脚本将数据推送到Postgres数据库中。我以前是这样插入记录的:

Today I spent my day improving the performance of my Python script which pushes data into my Postgres database. I was previously inserting records as such:

query = "INSERT INTO my_table (a,b,c ... ) VALUES (%s, %s, %s ...)";
for d in data:
    cursor.execute(query, d)

然后,我重新编写了脚本,以便它创建一个内存文件,而不是Postgres的 COPY 命令所使用的文件,该命令使我可以将数据从文件复制到表中:

I then re-wrote my script so that it creates an in-memory file than is the used for Postgres' COPY command, which lets me copy data from a file to my table:

f = StringIO(my_tsv_string)
cursor.copy_expert("COPY my_table FROM STDIN WITH CSV DELIMITER AS E'\t' ENCODING 'utf-8' QUOTE E'\b' NULL ''", f)

COPY 方法快得多

METHOD      | TIME (secs)   | # RECORDS
=======================================
COPY_FROM   | 92.998    | 48339
INSERT      | 1011.931  | 48377

但是我找不到关于为什么的任何信息?它与多行 INSERT 的工作方式有何不同,从而使它变得更快?

But I cannot find any information as to why? How does it work differently than a multiline INSERT such that it makes it so much faster?

请参见此基准

# original
0.008857011795043945: query_builder_insert
0.0029380321502685547: copy_from_insert

#  10 records
0.00867605209350586: query_builder_insert
0.003248929977416992: copy_from_insert

# 10k records
0.041108131408691406: query_builder_insert
0.010066032409667969: copy_from_insert

# 1M records
3.464181900024414: query_builder_insert
0.47070908546447754: copy_from_insert

# 10M records
38.96936798095703: query_builder_insert
5.955034017562866: copy_from_insert


推荐答案

这里有许多因素在起作用:

There are a number of factors at work here:


  • 网络延迟和往返行程ays

  • PostgreSQL中的每个语句开销

  • 上下文切换和调度程序延迟

  • COMMIT 的成本,如果对于每次插入一次提交的人(不是)

  • COPY 批量加载的特定优化

  • Network latency and round-trip delays
  • Per-statement overheads in PostgreSQL
  • Context switches and scheduler delays
  • COMMIT costs, if for people doing one commit per insert (you aren't)
  • COPY-specific optimisations for bulk loading

如果服务器是在远程,您可能正在支付每个语句的固定时间价格,例如50毫秒(1/20秒)。对于某些云托管数据库,甚至更多。由于下一次插入操作要等到最后一次成功完成后才能开始,所以这意味着您的最大插入速率为每秒1000 / round-trip-latency-in-ms行。延迟为50毫秒( ping时间),即20行/秒。即使在本地服务器上,此延迟也不为零。 Wheras COPY 仅填充TCP发送和接收窗口,并以DB可以写入它们并网络可以传输它们的速度来传输流。它不受延迟的影响很大,并且可能每秒在同一网络链接上插入数千行。

If the server is remote, you might be "paying" a per-statement fixed time "price" of, say, 50ms (1/20th of a second). Or much more for some cloud hosted DBs. Since the next insert cannot begin until the last one completes successfully, this means your maximum rate of inserts is 1000/round-trip-latency-in-ms rows per second. At a latency of 50ms ("ping time"), that's 20 rows/second. Even on a local server, this delay is nonzero. Wheras COPY just fills the TCP send and receive windows, and streams rows as fast as the DB can write them and the network can transfer them. It isn't affected by latency much, and might be inserting thousands of rows per second on the same network link.

在PostgreSQL中解析,计划和执行语句也有成本。它必须带锁,打开关系文件,查找索引等。 COPY 尝试一开始就完成所有这些操作,然后只专注于以最快的速度加载行

There are also costs to parsing, planning and executing a statement in PostgreSQL. It must take locks, open relation files, look up indexes, etc. COPY tries to do all of this once, at the start, then just focus on loading rows as fast as possible.

由于操作系统必须付出更多的时间成本在您的应用准备和发送时,在等待行的postgres之间切换,然后在postgres处理该行时,您的应用等待postgres的响应。每次从一个切换到另一个时,都会浪费一点时间。当进程进入和离开等待状态时,可能会浪费更多时间挂起和恢复各种低级内核状态。

There are further time costs paid due to the operating system having to switch between postgres waiting for a row while your app prepares and sends it, and then your app waiting for postgres's response while postgres processes the row. Every time you switch from one to the other, you waste a little time. More time is potentially wasted suspending and resuming various low level kernel state when processes enter and leave wait states.

最重要的是, COPY 进行了一些优化,可以用于某些负载。例如,如果没有生成的键,并且任何默认值都是常量,它可以预先计算它们并完全绕过执行程序,从而将数据快速加载到较低级别的表中,从而完全跳过了PostgreSQL的部分正常工作。如果您在同一笔交易中创建表 TRUNCATE ,则 COPY ,它可以绕过多客户端数据库中所需的常规事务簿记,从而提供更多技巧来加快加载速度。

On top of all that, COPY has some optimisations it can use for some kinds of loads. If there's no generated key and any default values are constants for example, it can pre-calculate them and bypass the executor completely, fast-loading data into the table at a lower level that skips part of PostgreSQL's normal work entirely. If you CREATE TABLE or TRUNCATE in the same transaction you COPY, it can do even more tricks for making the load faster by bypassing the normal transaction book-keeping needed in a multi-client database.

尽管如此,PostgreSQL的 COPY 仍然可以做更多的事情来加速事情,这还不知道怎么做。如果您要更改表的特定比例,它可以自动跳过索引更新,然后重建索引。它可以批量进行索引更新。

Despite this, PostgreSQL's COPY could still do a lot more to speed things up, things that it doesn't yet know how to do. It could automatically skip index updates then rebuild indexes if you're changing more than a certain proportion of the table. It could do index updates in batches. Lots more.

最后要考虑的一件事是提交成本。这对您来说可能不是问题,因为 psycopg2 默认情况下是打开交易,直到您告知后才提交。除非您告诉它使用自动提交。但是对于许多数据库驱动程序,自动提交是默认设置。在这种情况下,您将为每个 INSERT 执行一次提交。这意味着一个磁盘刷新,服务器确保将内存中的所有数据写出到磁盘上,并告诉磁盘将自己的缓存写出到持久性存储中。这可能需要很长的时间,并且会因硬件而异。我的基于SSD的NVMe BTRFS笔记本电脑每秒只能执行200 fsyncs,而每秒执行300,000次非同步写操作。因此它只会加载200行/秒!某些服务器只能执行50 fsync /秒。有些可以做到20,000。因此,如果您必须定期提交,请尝试批量加载和提交,进行多行插入等。由于 COPY 最后只执行一次提交,因此提交成本为微不足道。但这也意味着 COPY 无法从数据途中的错误中恢复;它会取消整个批量加载。

One final thing to consider is commit costs. It's probably not a problem for you because psycopg2 defaults to opening a transaction and not committing until you tell it to. Unless you told it to use autocommit. But for many DB drivers autocommit is the default. In such cases you'd be doing one commit for every INSERT. That means one disk flush, where the server makes sure it writes out all data in memory onto disk and tells the disks to write their own caches out to persistent storage. This can take a long time, and varies a lot based on the hardware. My SSD-based NVMe BTRFS laptop can do only 200 fsyncs/second, vs 300,000 non-synced writes/second. So it'll only load 200 rows/second! Some servers can only do 50 fsyncs/second. Some can do 20,000. So if you have to commit regularly, try to load and commit in batches, do multi-row inserts, etc. Because COPY only does one commit at the end, commit costs are negligible. But this also means COPY can't recover from errors partway through the data; it undoes the whole bulk load.

这篇关于COPY如何工作?为什么它比INSERT这么快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆