如何加快PostgreSQL中的插入性能 [英] How to speed up insertion performance in PostgreSQL

查看:832
本文介绍了如何加快PostgreSQL中的插入性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试Postgres插入性能.我有一张表,其中一列以数字作为其数据类型.也有一个索引.我使用以下查询填充数据库:

I am testing Postgres insertion performance. I have a table with one column with number as its data type. There is an index on it as well. I filled the database up using this query:

insert into aNumber (id) values (564),(43536),(34560) ...

通过上面的查询,我一次非常快地一次插入了400万行10,000.数据库达到600万行后,性能每15分钟急剧下降到100万行.有什么技巧可以提高插入性能?我需要在该项目上获得最佳的插入性能.

I inserted 4 million rows very quickly 10,000 at a time with the query above. After the database reached 6 million rows performance drastically declined to 1 Million rows every 15 min. Is there any trick to increase insertion performance? I need optimal insertion performance on this project.

在具有5 GB RAM的计算机上使用Windows 7 Pro.

Using Windows 7 Pro on a machine with 5 GB RAM.

推荐答案

请参见填充PostgreSQL手册中的数据库这个SO问题.

(请注意,此答案是关于将数据批量加载到现有数据库中或创建新数据库的.如果您有兴趣使用pg_restorepsql执行pg_dump输出来恢复数据库性能,其中的大部分内容都不适用,因为pg_dumppg_restore已经完成了完成架构+数据还原后创建触发器和索引之类的操作.

(Note that this answer is about bulk-loading data into an existing DB or to create a new one. If you're interested DB restore performance with pg_restore or psql execution of pg_dump output, much of this doesn't apply since pg_dump and pg_restore already do things like creating triggers and indexes after it finishes a schema+data restore).

还有很多事情要做.理想的解决方案是导入没有索引的UNLOGGED表,然后将其更改为已记录并添加索引.不幸的是,在PostgreSQL 9.4中,不支持将表从UNLOGGED更改为已记录. 9.5添加ALTER TABLE ... SET LOGGED允许您执行此操作.

There's lots to be done. The ideal solution would be to import into an UNLOGGED table without indexes, then change it to logged and add the indexes. Unfortunately in PostgreSQL 9.4 there's no support for changing tables from UNLOGGED to logged. 9.5 adds ALTER TABLE ... SET LOGGED to permit you to do this.

如果您可以使数据库脱机以进行批量导入,请使用 pg_bulkload .

If you can take your database offline for the bulk import, use pg_bulkload.

否则:

  • 禁用表上的所有触发器

  • Disable any triggers on the table

在开始导入之前先删除索引,然后再重新创建它们. (一次建立索引所花费的时间少于将其逐步添加相同数据所花费的时间 ,并且所得到的索引要紧凑得多.)

Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact).

如果在单个事务中进行导入,则在提交之前删除外键约束,进行导入并重新创建约束是安全的.如果导入分散在多个事务中,请不要这样做,因为这可能会引入无效数据.

If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data.

如果可能,使用COPY代替INSERT s

If possible, use COPY instead of INSERTs

如果不能使用COPY,请考虑使用多值INSERT(如果可行).您似乎已经在执行此操作.但是,不要尝试在单个VALUES中列出太多的 值;这些值必须多次存储在内存中,因此每个语句的值应保持在几百个.

If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement.

将插入的内容分为显式事务,每个事务进行数十万或数百万个插入. AFAIK没有实际限制,但批处理可通过在输入数据中标记每个批处理的开始来使您从错误中恢复.同样,您似乎已经在执行此操作.

Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already.

使用synchronous_commit=off和大量的commit_delay可以减少fsync()的成本.不过,如果您将工作分批处理成大笔交易,那将无济于事.

Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though.

INSERTCOPY来自多个连接.有多少取决于您的硬件的磁盘子系统;根据经验,如果使用直接连接的存储,则每个物理硬盘驱动器需要一个连接.

INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage.

设置较高的checkpoint_segments值并启用log_checkpoints.查看PostgreSQL日志,确保它没有抱怨检查点过于频繁.

Set a high checkpoint_segments value and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently.

如果并且仅当您不介意在导入过程中系统崩溃时,将整个PostgreSQL群集(数据库和同一群集中的其他数据库)丢失而导致灾难性损坏时,可以停止Pg,将其设置为,启动Pg,进行导入,然后(必要时)停止Pg并再次设置fsync=on.请参见 WAL配置. 如果您在PostgreSQL安装上的任何数据库中已经有任何数据需要关注,请不要这样做.如果您设置了fsync=off,您还可以设置full_page_writes=off;再次,只是记得在导入后将其重新打开,以防止数据库损坏和数据丢失.参见Pg手册中的非持久设置.

If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.

您还应该考虑调整系统:

You should also look at tuning your system:

  • 尽可能使用高质量 SSD进行存储.具有可靠的,受功率保护的回写式高速缓存的优质SSD可以使提交速度变得异常快.当您按照上面的建议使用时,它们的好处较小-减少了磁盘刷新次数/fsync()的数量-但是仍然可以提供很大的帮助.除非您不关心保存数据,否则不要使用没有适当电源故障保护功能的廉价SSD.

  • Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data.

如果您将RAID 5或RAID 6用于直接连接的存储,请立即停止.备份数据,将RAID阵列重组为RAID 10,然后重试. RAID 5/6对于批量写入性能没有希望-尽管具有良好缓存的良好RAID控制器可以提供帮助.

If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help.

如果您可以选择使用具有大容量电池支持的回写缓存的硬件RAID控制器,则可以真正提高具有大量提交的工作负载的写入性能.如果您正在使用带有commit_delay的异步提交,或者在批量加载期间执行的大型事务较少,则无济于事.

If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading.

如果可能,将WAL(pg_xlog)存储在单独的磁盘/磁盘阵列上.在同一磁盘上使用单独的文件系统毫无意义.人们经常选择对WAL使用RAID1对.再次,这对具有高提交率的系统有更大的影响,并且如果您使用未记录的表作为数据加载目标,则几乎没有影响.

If possible, store WAL (pg_xlog) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.

您可能还对优化PostgreSQL以进行快速测试感兴趣.

这篇关于如何加快PostgreSQL中的插入性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆