提高mysql LOAD DATA/mysqlimport的性能? [英] Improve performance of mysql LOAD DATA / mysqlimport?

查看:131
本文介绍了提高mysql LOAD DATA/mysqlimport的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将15GB(30mio行)的 CSV 批处理到 mysql-8 数据库中.

I'm batching CSV 15GB (30mio rows) into a mysql-8 database.

问题:任务大约需要20分钟,大约需要15-20 MB/s的吞吐量.虽然硬盘驱动器能够以150 MB/s的速度传输文件.

Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.

我有一个20GB的RAM磁盘,可以存放我的csv.导入如下:

I have a RAM disk of 20GB, which holds my csv. Import as follows:

mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv

这在引擎盖下使用了 LOAD DATA .我的目标表没有任何索引,但是大约有100列(我无法更改此列).

This uses LOAD DATA under the hood. My target table does not have any indexes, but approx 100 columns (I cannot change this).

奇怪的是:我尝试在/etc/mysql/my.cnf 中调整以下几个配置参数,但并没有带来任何明显的改善:

What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:

log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2

问题: LOAD DATA / mysqlimport 是否尊重那些配置更改?还是绕过?还是我使用了正确的配置文件?

Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?

至少选择一个变量显示它们已被mysql服务器正确加载.例如, show变量如'innodb_doublewrite'显示 OFF

At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF

无论如何,我如何才能进一步提高进口速度?还是我的数据库成为瓶颈,有没有办法克服15-20 MB/s的阈值?

Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?

更新:有趣的是,如果我将csv从硬盘驱动器导入到ramdisk中,性能几乎是相同的(只是稍微好一点,但从未超过25 MB/s).我还测试了相同数量的行,但只有几(5)列.我的速度达到了约80 MB/s.显然列数是瓶颈吗?但是,为什么更多的列会减慢该过程呢?

Update: Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?

推荐答案

这可能是正常的.让我们来看看正在做的事情:

This could be normal. Let's walk through what is being done:

  • 正在从RAM磁盘读取csv文件,因此没有使用IOP.
  • 您是否正在使用InnoDB?如果是这样,则数据将进入buffer_pool.在此处构建块时,它们被标记为脏",以便最终刷新到磁盘.
  • 由于buffer_pool很大,但可能不如表变大,因此在完成读取所有数据之前,需要刷新一些个块.
  • 读取所有数据,并完成表后,脏块将逐渐刷新到磁盘.
  • 如果您有非唯一索引,则类似地,它们将以延迟方式写入磁盘(参见更改缓冲").默认情况下,change_buffer占buffer_pool的25%.
  • The csv file is being read from a RAM disk, so no IOPs are being used.
  • Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
  • Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
  • After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
  • If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.

结果表有多大?它可能比csv文件的15GB更大,甚至更小.

How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.

将csv文件带入ram磁盘需要花费多少时间?我提供的是浪费的时间,应该在执行 LOAD DATA 的同时从磁盘读取它;I/O可以重叠.

How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.

像'innodb%'一样显示GLOBAL VARIABLES; ;还有其他一些可能相关.

Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.

更多

这些是可怕的:

tmp_table_size=20G
max_heap_table_size=20G

如果您有复杂的查询,则可以在RAM中分配20GB,可能多次!将它们的内存保持在1%以下.

If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.

如果将csv从硬盘复制到ram磁盘运行缓慢,我怀疑150 MB/s的有效性.

If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.

如果您每6个小时加载一次表格,并且执行该表格需要花费一个小时的1/3,那么我认为没有必要提高它的速度.太太,也许有些值得研究的东西.如果由于锁定表而导致20分钟的停机时间,则可以轻松消除:

If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:

CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...;    -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table;  -- atomic; fast
DROP TABLE old;

这篇关于提高mysql LOAD DATA/mysqlimport的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆