批量创建大量记录的最佳做法 [英] Best practice of bulk_create for massive records

查看:133
本文介绍了批量创建大量记录的最佳做法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 bulk_create 将1个纪录记录插入到新表中。需要80秒Django只使用一个CPU内核(大约25%的CPU,但没有核心达到100%)我相信有改进的潜力。

I use bulk_create to insert 1 mio records to a new table. It takes 80 seconds. Django only uses one CPU core (roughly 25% CPU, but no core is reaching 100%) I believe there is improvement potential.

这里是代码

class Stock(models.Model):
    code = models.CharField(max_length=6, unique=True)
    name = models.CharField(max_length=8)

class Watchers(models.Model):
    date = models.DateField(db_index=True)
    stock = models.ForeignKey(Stock, unique_for_date="date")
    num = models.PositiveIntegerField(default=0)

batch = []
for input in inputs:
    watcher = Watcher(date=input['date'], stock=get_stock(), num=input['num'])
    batch.append(watcher)
Watcher.objects.bulk_create(batch)

I尝试了几件事:


  1. 使用正确的 batch_size 。对我来说最好的价值大约是4000.需要80-90秒。

  2. 使用 ThreadPool()。要慢一些,大概是120-140秒。

  3. 删除 DateField 上的索引。有点慢于1.

  1. Use a proper batch_size. The best value for me is roughly 4000. It takes 80-90 seconds.
  2. Use a ThreadPool(). It is much slower, roughly 120-140 seconds
  3. Remove the index on DateField. A bit slower than 1.

我正在使用MySQL 5.6社区版本。存储引擎是MyISAM。这是配置。

I'm using MySQL 5.6 community edition. Storage engine is MyISAM. Here is the config.

[mysqld]
server-id   = 1
default-storage-engine = MyISAM
port        = 3306
socket      = /tmp/mysql.sock
datadir     = "{MYSQLDATAPATH}"
skip-external-locking
explicit_defaults_for_timestamp = TRUE

# MyISAM #
key-buffer-size = 32M
max_allowed_packet = 16M

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
query-cache-type               = 0
query-cache-size               = 0
max-connections                = 500
thread-cache-size              = 50
open-files-limit               = 65535
table-definition-cache         = 1024
table-open-cache               = 2048

# LOGGING
log-bin       = mysql-bin
binlog_format = mixed

当我导入另一个表(类似结构,相同的索引,但有9列),需要15分钟。时间增加不是线性的。

When I import another table (similar structure, same indices, but has 9 columns), it takes 15 minutes. The time increase is not linear.

bulk_create有什么问题吗?

Anything wrong with bulk_create?

虽然我接受了答案,但我认为我应该明白这个谜。所以我做了一些更多的测试,发现Django的模型创建是减慢的根本原因。当我有80万条记录的时候,打800,000次创建一个模型会非常耗时。

Although I've accept the answer, but I think I should understand the mystery. So I did some more tests and found that the Django's model creation is the root cause of slow down. When I have 800000 records, calling 800000 times create a model will be very time consuming.

ORM框架执行了许多我们看不到的内部工作,例如完整性检查。在我的情况下,大量记录将被导入到一个空的数据库表中,所以不需要检查。

The ORM framework does many internal works that we do not see, for instance the integrity check. In my case, massive records will be imported to an empty database table, so the check is not necessary.

现在我使用 cursor.executemany ),缩短80秒4列记录的插入时间从54秒缩短到16秒。并缩短从13分钟到46秒的800000 13列记录的插入时间。

Now I use cursor.executemany(), which shorten the insertion time of 800000 4-column records from 54 second to 16 second. And shorten the insertion time of 800000 13-column records from 13 minutes to 46 seconds.

根据我的实验,您可以为每3000-5000条记录调用 executemany 。我在一次电话中尝试了一次800k的记录,这是非常慢的。

According to my experiment, you'd call executemany for every 3000-5000 records. I tried once 800k records in one call, this is extremely slow.

推荐答案

虽然bulk_create对于保存少量记录很有用在处理HTML表单时,它不适合保存数千条记录。如你已经发现的,它很慢,因为它需要大量内存,并向数据库发送一个非常大的查询。 Fortunatley 文件中的载入数据来拯救。 / p>

While bulk_create is useful for saving a small number of records while processing an HTML form, it's not ideally suited for saving thousands of records. As you have found out already, it's slow because it needs a lot of memory and sends a very large query to the database. Fortunatley LOAD DATA IN FILE comes to the rescue.


LOAD DATA INFILE语句以非常高的速度将文本文件中的行读入
表。 LOAD DATA INFILE是
SELECT ... INTO OUTFILE的补码。

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed. LOAD DATA INFILE is the complement of SELECT ... INTO OUTFILE.

我们可以生成类似于所生产的文件通过使用 csv writer ,以下示例来自文档。

We can produce a file similar to what's produced by using csv writer the following example is from the documentation.

import csv
    with open('some.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(someiterable)

最后,如你已经发现的,有时可以使用LOCAL选项加载数据。

Finally as you have already found out, the LOCAL option to LOAD DATA .. can sometimes be used for convenience.


只有当您的服务器和客户端都已经
配置为允许它

LOCAL works only if your server and your client both have been configured to permit it

使用此选项时,该文件不需要手动传输到服务器。您可以在客户端生成CSV文件,本地选项将导致mysql客户端自动将文件传输到服务器。

When using this option, the file does not need to be manually transferred to the server. You can generate the CSV file on the client side and the local option will cause the mysql client to automatically transfer the file to the server.


如果未指定LOCAL,则该文件必须位于服务器主机
上,并由服务器直接读取。

If LOCAL is not specified, the file must be located on the server host and is read directly by the server.

这篇关于批量创建大量记录的最佳做法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆