Mysql插入到select查询中太慢,无法复制1亿行 [英] MySql insert into select query is too slow to copy 100 million rows

查看:155
本文介绍了Mysql插入到select查询中太慢,无法复制1亿行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含100+百万行的表,并想将数据复制到另一个表中.我有1个要求,1.查询执行不得阻止对这些数据库表的其他操作,我已经编写了如下存储过程

I have one table consisting of 100+ millions rows, and want to copy data into another table. I have 1 requirements, 1. Query execution must not block the other operations to these database tables, I have written a stored procedure as following

我计算到源表中的行数,然后进行循环,但在每次迭代中复制10000行,开始事务并提交.然后按偏移量读取下一个10000.

I count the number of rows into source table then have a loop but copy 10000 rows in each iterations, start transaction and commit it. then read next 10000 by offset.

CREATE PROCEDURE insert_data()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE iterations INT DEFAULT 0;
  DECLARE rowOffset INT DEFAULT 0;
  DECLARE limitSize INT DEFAULT 10000;
  SET iterations = (SELECT COUNT(*) FROM Table1) / 10000;

  WHILE i <= iterations DO
    START TRANSACTION;
        INSERT IGNORE INTO Table2(id, field2, field3)
            SELECT f1, f2, f3
            FROM Table1
            ORDER BY id ASC
            LIMIT limitSize offset rowOffset;
    COMMIT;
    SET i = i + 1;
    SET rowOffset = rowOffset + limitSize;
  END WHILE;
END$$
DELIMITER ;

执行查询时不会锁定表,但是在复制了几百万行之后,它变得太慢了.请提出任何更好的方法来完成任务.谢谢!

The query executes without locking the tables but after copying few millions rows it has become too slow. Please suggest any better way to do the task. Thanks you!

推荐答案

任何 INSERT ... SELECT ... 查询都会

Any INSERT ... SELECT ... query does acquire a SHARED lock on the rows it reads from the source table in the SELECT. But by processing smaller chunks of rows, the lock doesn't last too long.

带有 LIMIT ... OFFSET 的查询随着在源表中前进而变得越来越慢.对于每个块10,000行,您需要运行该查询10,000次,每个查询都必须重新开始并扫描表以达到新的OFFSET.

The query with LIMIT ... OFFSET is going to be slower and slower as you advance through the source table. At 10,000 rows per chunk, you need to run that query 10,000 times, each one has to start over and scan through the table to reach the new OFFSET.

无论您做什么,都要复制一亿行.它正在做很多工作.

No matter what you do, copying 100 million rows is going to take a while. It's doing a lot of work.

我会使用 pt-archiver ,为此目的而设计的免费工具.它以块"(或子集)处理行.它将动态调整块的大小,以便每个块花费0.5秒.

I would use pt-archiver, a free tool designed for this purpose. It processes the rows in "chunks" (or subsets). It will dynamically adjust the size of the chunks so that each chunk takes 0.5 seconds.

您的方法与pt-archiver之间的最大区别是pt-archiver不使用 LIMIT ... OFFSET ,它沿主键索引移动,而是按值选择行的大块按位置.因此,每个块的读取效率更高.

The biggest difference between your method and pt-archiver is that pt-archiver doesn't use LIMIT ... OFFSET, it walks along the primary key index, selecting chunks of row by value instead of by position. So every chunk is read more efficiently.

发表您的评论

我希望减小批量大小并增加迭代次数将使性能问题变得更糟,而不是更好.

I expect that making the batch size smaller — and increasing the number of iterations — will make the performance problem worse, not better.

原因是,当您将 LIMIT OFFSET 一起使用时,每个查询都必须从表的开头重新开始,并对直到的行进行计数> OFFSET 值.当您遍历表格时,时间会越来越长.

The reason is that when you use LIMIT with OFFSET, every query has to start over at the start of the table, and count the rows up to the OFFSET value. This gets longer and longer as you iterate through the table.

使用 OFFSET 运行20,000个昂贵的查询所花的时间要比运行10,000个类似的查询所花的时间更长.最昂贵的部分将不会读取5,000或10,000行,也不会将其插入到目标表中.昂贵的部分将一遍又一遍地跳过约5000万行.

Running 20,000 expensive queries using OFFSET will take longer than running 10,000 similar queries. The most expensive part will not be reading 5,000 or 10,000 rows, or inserting them into the destination table. The expensive part will be skipping through ~50,000,000 rows, over and over again.

相反,您应该通过而不是通过偏移量遍历表.

Instead, you should iterate over the table by values not by offsets.

INSERT IGNORE INTO Table2(id, field2, field3)
        SELECT f1, f2, f3
        FROM Table1
        WHERE id BETWEEN rowOffset AND rowOffset+limitSize;

循环之前,查询MIN(id)和MAX(id),并从最小值开始 rowOffset ,然后循环到最大值.

Before the loop, query the MIN(id) and MAX(id), and start rowOffset at the min value, and loop up to the max value.

这是pt存档器的工作方式.

This is the way pt-archiver works.

这篇关于Mysql插入到select查询中太慢,无法复制1亿行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆