Python:将多进程插入MySQL的代码有什么问题? [英] Python : What's wrong with my code of multi processes inserting to MySQL?

查看:318
本文介绍了Python:将多进程插入MySQL的代码有什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个Python脚本,将一些数据(3亿个)插入到MySQL表中:

I've write a Python script to insert some data(300 millions) to a MySQL table:

#!/usr/bin/python

import os
import MySQLdb

from multiprocessing import Pool

class DB(object):

  def __init__(self):
    self.conn = MySQLdb.connect(host='localhost',user='root',passwd='xxx',db='xdd',port=3306)
    self.cur = self.conn.cursor()

  def insert(self, arr):
    self.cur.execute('insert into RAW_DATA values(null,%s,%s,%s,%s,%s,%s,%s)', arr)

  def close(self):
    self.conn.commit()
    self.cur.close()
    self.conn.close()


def Import(fname):
  db = DB()

  print 'importing ', fname
  with open('data/'+fname, 'r') as f:

    for line in f:
      arr = line.split()
      db.insert(arr)

  db.close()


if  __name__ == '__main__':
  # 800+ files
  files = [d for d in os.listdir('data') if d[-3:]=='txt']

  pool = Pool(processes = 10)
  pool.map(Import, files)

问题是,脚本运行非常非常慢,使用多处理程序是否存在明显的错误?

The problem is, the script runs very very slow, is there any obvious wrong using of multiprocessing ?

推荐答案

是的,如果要向同一表中批量插入3亿行,则不应尝试并行化此插入.所有插入都必须经历相同的瓶颈:更新索引,并写入硬盘上的物理文件.这些操作需要对基础资源(索引或磁盘头)的独占访问权.

Yes, if you are bulk-inserting 300 million rows into the same table, then you shouldn't try to parallelize this insertion. All inserts must go through the same bottlenecks: updating the index, and writing into the physical file on the hard disk. These operations require exclusive access to the underlying resources (the index, or the disk head).

您实际上在数据库上增加了一些无用的开销,该数据库现在必须处理多个并发事务.这会消耗内存,强制进行上下文切换,使磁盘读取头一直在跳动,等等.

You are actually adding some useless overhead on the database which now has to handle several concurrent transactions. This consumes memory, forces context switching, makes the disk read head jump around all the time, and so on.

将所有内容插入同一线程.

Insert everything in the same thread.

看起来您实际上是从某种CSV文件导入数据.您可能要使用内置的 LOAD DATA INFILE MySQL命令,就是为此目的而设计的.如果需要一些有关调整此命令的帮助,请描述您的源文件.

It looks like you are actually importing data from a kind of CSV file. You may want to use the built-in LOAD DATA INFILE MySQL command, designed for this very purpose. Please describe your source file if you need some help in tuning this command.

这篇关于Python:将多进程插入MySQL的代码有什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆