如何加快 PostgreSQL 中的插入性能 [英] How to speed up insertion performance in PostgreSQL

查看:47
本文介绍了如何加快 PostgreSQL 中的插入性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试 Postgres 插入性能.我有一个表,其中一列以数字作为其数据类型.上面也有一个索引.我用这个查询填满了数据库:

I am testing Postgres insertion performance. I have a table with one column with number as its data type. There is an index on it as well. I filled the database up using this query:

insert into aNumber (id) values (564),(43536),(34560) ...

我用上面的查询非常快地一次插入了 400 万行 10,000 行.数据库达到 600 万行后,性能急剧下降到每 15 分钟 100 万行.有什么技巧可以提高插入性能吗?我需要这个项目的最佳插入性能.

I inserted 4 million rows very quickly 10,000 at a time with the query above. After the database reached 6 million rows performance drastically declined to 1 Million rows every 15 min. Is there any trick to increase insertion performance? I need optimal insertion performance on this project.

在具有 5 GB RAM 的机器上使用 Windows 7 Pro.

Using Windows 7 Pro on a machine with 5 GB RAM.

推荐答案

参见 populatePostgreSQL 手册中的数据库depesz 关于该主题的出色文章,以及这个问题.

(请注意,此答案是关于将数据批量加载到现有数据库中或创建新数据库.如果您对使用 pg_restorepsql 的数据库还原性能感兴趣 执行 pg_dump 输出,其中大部分不适用,因为 pg_dumppg_restore 已经做了一些事情,比如在之后创建触发器和索引它完成了架构+数据恢复).

(Note that this answer is about bulk-loading data into an existing DB or to create a new one. If you're interested DB restore performance with pg_restore or psql execution of pg_dump output, much of this doesn't apply since pg_dump and pg_restore already do things like creating triggers and indexes after it finishes a schema+data restore).

还有很多事情要做.理想的解决方案是导入没有索引的 UNLOGGED 表,然后将其更改为已记录并添加索引.不幸的是,在 PostgreSQL 9.4 中不支持将表从 UNLOGGED 更改为已记录.9.5 添加了 ALTER TABLE ... SET LOGGED 以允许您执行此操作.

There's lots to be done. The ideal solution would be to import into an UNLOGGED table without indexes, then change it to logged and add the indexes. Unfortunately in PostgreSQL 9.4 there's no support for changing tables from UNLOGGED to logged. 9.5 adds ALTER TABLE ... SET LOGGED to permit you to do this.

如果您可以将数据库脱机进行批量导入,请使用 pg_bulkload.

If you can take your database offline for the bulk import, use pg_bulkload.

否则:

  • 禁用表上的任何触发器

  • Disable any triggers on the table

在开始导入之前删除索引,然后重新创建它们.(与逐步添加相同数据相比,一次构建索引所需的时间要少得多,而且生成的索引要紧凑得多).

Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact).

如果在单个事务中执行导入,那么在提交之前删除外键约束、执行导入并重新创建约束是安全的.如果导入拆分到多个事务中,请勿执行此操作,因为您可能会引入无效数据.

If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data.

如果可能,使用 COPY 而不是 INSERTs

If possible, use COPY instead of INSERTs

如果您不能使用 COPY,请考虑在可行的情况下使用多值 INSERT.你似乎已经在这样做了.不过,不要尝试在单个 VALUES 中列出 太多 个值;这些值必须多次放入内存中,因此每个语句保持在几百个.

If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement.

将您的插入分批处理到显式事务中,每个事务执行数十万或数百万次插入.AFAIK 没有实际限制,但批处理可以让您通过在输入数据中标记每个批次的开始来从错误中恢复.同样,您似乎已经在这样做了.

Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already.

使用 synchronous_commit=off 和巨大的 commit_delay 来降低 fsync() 成本.不过,如果您将工作分批处理为大事务,这将无济于事.

Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though.

INSERTCOPY 从多个连接并行.多少取决于您的硬件的磁盘子系统;根据经验,如果使用直接附加存储,您需要每个物理硬盘驱动器一个连接.

INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage.

设置较高的 max_wal_size 值(旧版本中为 checkpoint_segments)并启用 log_checkpoints.查看 PostgreSQL 日志并确保它没有抱怨检查点发生得太频繁.

Set a high max_wal_size value (checkpoint_segments in older versions) and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently.

如果且仅当您不介意在导入过程中系统崩溃时丢失整个 PostgreSQL 集群(您的数据库和同一集群上的任何其他集群)到灾难性损坏,您可以停止 Pg,设置 fsync=off,启动 Pg,进行导入,然后(重要地)停止 Pg 并再次设置 fsync=on.请参阅WAL 配置.如果 PostgreSQL 安装的任何数据库中已经有您关心的任何数据,请不要这样做. 如果您设置 fsync=off,您还可以设置 full_page_writes=关闭;同样,请记住在导入后重新打开它以防止数据库损坏和数据丢失.请参阅 Pg 手册中的非耐用设置.

If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.

您还应该考虑调整系统:

You should also look at tuning your system:

  • 尽可能使用优质 SSD 进行存储.具有可靠、受电源保护的回写缓存的优质 SSD 使提交率非常快.当您遵循上述建议时,它们的好处会减少——这会减少磁盘刷新/fsync() 的数量——但仍然有很大帮助.除非您不关心保存数据,否则不要使用没有适当断电保护的廉价 SSD.

  • Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data.

如果您将 RAID 5 或 RAID 6 用于直接连接存储,请立即停止.备份您的数据,将您的 RAID 阵列重组为 RAID 10,然后重试.RAID 5/6 对批量写入性能毫无希望 - 尽管具有大缓存的良好 RAID 控制器会有所帮助.

If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help.

如果您可以选择使用带有大容量电池支持的回写缓存的硬件 RAID 控制器,这可以真正提高具有大量提交的工作负载的写入性能.如果您使用带有 commit_delay 的异步提交,或者在批量加载期间执行较少的大事务,则没有多大帮助.

If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading.

如果可能,将 WAL(pg_wal 或旧版本中的 pg_xlog)存储在单独的磁盘/磁盘阵列上.在同一个磁盘上使用单独的文件系统没有什么意义.人们通常选择将 RAID1 对用于 WAL.同样,这对具有高提交率的系统影响更大,如果您使用未记录的表作为数据加载目标,则影响很小.

If possible, store WAL (pg_wal, or pg_xlog in old versions) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.

您可能还对优化 PostgreSQL 以进行快速测试感兴趣.

这篇关于如何加快 PostgreSQL 中的插入性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆