如何加快从pandas.DataFrame .to_sql的插入 [英] How to speed up insertion from pandas.DataFrame .to_sql

查看:544
本文介绍了如何加快从pandas.DataFrame .to_sql的插入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我目前正在尝试将四个熊猫数据帧中的数据写入本地机器上的mysql,我的机器要花费32秒的时间才能插入20,000条记录(每个表5000条)代码-

Hello i am currently trying to write data from four pandas data frame to mysql on my local machine , my machine is taking 32 seconds for inserting 20,000 records (5000 for each table) Code-

表格- 1)帖子 2)post_stats 3)post_languages 4)post_tags

tables - 1) posts 2) post_stats 3) post_languages 4) post_tags

engine = create_engine("mysql+mysqldb://root:dbase@123@localhost/testDb")

startTime=time.time()

dfstat.to_sql('post_stats', con=engine, if_exists='append', index=False)
for i in range(0, dfp.shape[0]):
ss = str(dfp.iloc[i][0])
sss = 'Select id from post_stats where post_id =\"%s\"' % (ss)
#print(sss)
rss = engine.execute(sss)
x = rss.fetchone()
dfp['stats_id'][i] = x[0]
dfp.to_sql('posts', con=engine, if_exists='append', index=False)
dfl.to_sql('post_languages', con=engine, if_exists='append', index=False)
dftagv.to_sql('post_tags', con=engine, if_exists='append', index=False)


endTime=time.time()
diff=endTime-startTime 
print(diff)

货币我正在将数据存储在本地计算机中,但将来我必须将数据发送到mysql服务器,有什么方法可以加快插入速度 还是有什么不同的方法,以便我可以像使用批量插入一样以更快的速度存储数据.请建议

Currenlty i am storing the data in my local machine but in future i have to send data over to mysql server , Is there any way to speed up insertion or is there any different approach so that i can store data at a faster rate like using bulk insert. please suggest

推荐答案

这里的问题是对每一行进行插入查询,然后在下一行插入之前等待ACK.

The problem here is for each row an insert query is made, then before next row insert it waits for ACK.

尝试在import pandas as pd

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print("Using monkey-patched _execute_insert")
    data = [dict((k, v) for k, v in zip(keys, row)) for row in data_iter]
    conn.execute(self.insert_statement().values(data))

SQLTable._execute_insert = _execute_insert

这是 nhockham 上to_sql插入的修补程序,该修补程序逐行插入. 这是github问题.

This is a patch by nhockham on to_sql insert which inserts line by line. Here's the github issue.

如果您可以放弃使用pandas.to_sql,建议您尝试sql-alchemy批量插入,或者只是自己编写脚本以进行多行查询.

If you can forgo using pandas.to_sql I suggest you try sql-alchemy bulk insert or just write script to make a multirow query by yourself.

为了澄清起见,我们正在pandas.io.sql中修改类SQLTable的_execute_insert方法 因此,必须在导入熊猫模块之前将其添加到脚本中.

To clarify we are modifying _execute_insert method of Class SQLTable in pandas.io.sql So this has to be added in the scripts before import pandas module.

最后一行是更改.

conn.execute(self.insert_statement(), data)已更改为:

conn.execute(self.insert_statement().values(data))

第一行将逐行插入,而最后一行将在一个sql语句中插入所有行.

The first line will insert row by row while last line will insert all rows in one sql statement.

更新:对于较新版本的熊猫,我们将需要对上述查询进行一些修改.

Update: For newer versions of pandas, we will need a slight modification of the above query.

from pandas.io.sql import SQLTable

def _execute_insert(self, conn, keys, data_iter):
    print("Using monkey-patched _execute_insert")
    data = [dict(zip(keys, row)) for row in data_iter]
    conn.execute(self.table.insert().values(data))

SQLTable._execute_insert = _execute_insert

这篇关于如何加快从pandas.DataFrame .to_sql的插入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆