将大量数据加载到PostgreSQL的最佳方法是什么? [英] What is the best way to load a massive amount of data into PostgreSQL?

查看:503
本文介绍了将大量数据加载到PostgreSQL的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将大量数据加载到PostgreSQL中.除了PostgreSQL文档中提到的那些技巧之外,您还知道其他任何技巧吗?

I want to load a massive amount of data into PostgreSQL. Do you know any other "tricks" apart from the ones mentioned in the PostgreSQL's documentation?

到目前为止我做了什么?

What have I done up to now?

1)在postgresql.conf中设置以下参数(用于64 GB的RAM):

1) set the following parameters in postgresql.conf (for 64 GB of RAM):

    shared_buffers = 26GB 
    work_mem=40GB
    maintenance_work_mem = 10GB       #  min 1MB default: 16 MB
    effective_cache_size = 48GB
    max_wal_senders = 0     # max number of walsender processes
    wal_level = minimal         # minimal, archive, or hot_standby
    synchronous_commit = off # apply when your system only load data (if there are other updates from clients it can result in data loss!)
    archive_mode = off      # allows archiving to be done
    autovacuum = off            # Enable autovacuum subprocess?  'on'
    checkpoint_segments = 256       # in logfile segments, min 1, 16MB each; default = 3; 256 = write every 4 GB
    checkpoint_timeout = 30min         # range 30s-1h, default = 5min
    checkpoint_completion_target = 0.9  # checkpoint target duration, 0.0 - 1.0
    checkpoint_warning = 0              # 0 disables, default = 30s

2)事务(禁用自动提交)+设置隔离级别(最低级别:可重复读取)我创建了一个新表,并在同一事务中将数据加载到其中.

2) transactions (disabled autocommit) + set isolation level (the lowest possible: repeatable read) I create a new table and load data into it in the same transaction.

3)设置COPY命令以运行单个事务(据说这是处理COPY数据的最快方法)

3) set COPY commands to run a single transaction (supposedly it is the fastest approach to COPY data)

5)禁用自动真空(添加新的50行后将不会重新生成统计信息)

5) disabled autovacuum (will not regenerate statistics after new 50 rows added)

6)FREEZE COPY FREEZE不能提高导入本身的速度,但是可以使导入后的操作更快.

6) FREEZE COPY FREEZE does not speed up the import itself but makes operations after the import faster.

您还有其他建议吗,或者您不同意上述设置?

Do you have any other recommendations or maybe you do not agree with the aforementioned settings?

推荐答案

不要不要使用索引,唯一的单个数字键除外.

Do NOT use indexes except for unique single numeric key.

这并不符合我们收到的所有数据库理论,但是对大量数据的测试证明了这一点.这是一次加载100M负载的结果,该负载达到一个表中的20亿行,并且每次对结果表进行一堆各种各样的查询.第一个具有10 GB NAS(150MB/s)的图形,第二个具有4个SSD的RAID 0(R/W @ 2GB/s).

That doesn't fit with all DB theory we received but testing with heavy loads of data demonstrate it. Here is a result of 100M loads at a time to reach 2 Billions rows in a table, and each time a bunch of various queries on the resulting table. First graphic with 10 gigabit NAS (150MB/s), second with 4 SSD in RAID 0 (R/W @ 2GB/s).

如果常规磁盘上的表中有超过2亿行,则忘记索引会更快.在SSD上,限制为10亿个.

If you have more than 200 millions row in a table on regular disks, it's faster if you forget indexes. On SSD's, the limit is at 1 billion.

我也对分区进行了处理以获得更好的结果,但是对于PG9.2,如果使用存储过程,则很难从中受益.您还必须一次只对1个分区进行写入/读取操作.但是,分区是将表保持在10亿行墙以下的方法.它还有助于很多来对您的负载进行多处理.使用SSD时,单步处理让我以每秒18,000行的速度插入(复制)(包括一些处理工作).通过在6个CPU上进行多处理,它可以提高到80,000行/秒.

I've done it also with partitions for better results but with PG9.2 it's difficult to benefit from them if you use stored procedures. You also have to take care of writing/reading to only 1 partition at a time. However partitions are the way to go to keep your tables below the 1 Billion row wall. It also helps a lot to multiprocess your loads. With SSD, single process let me insert (copy) 18,000 rows/s (with some processing work included). With multiprocessing on 6 CPU, it grows to 80,000 rows/s.

注意您的CPU和进行测试以同时优化两者时的IO使用情况.

Watch your CPU & IO usage while testing to optimize both.

这篇关于将大量数据加载到PostgreSQL的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆