pandas.DataFrame.to_sql 中的最佳块大小参数 [英] Optimal chunksize parameter in pandas.DataFrame.to_sql
问题描述
使用需要转储到 PostgreSQL 表中的大型 Pandas DataFrame.从我读过的内容来看,一次全部转储并不是一个好主意,(我正在锁定数据库)而是使用 chunksize
参数.此处的答案对工作流程有帮助,但我只是询问块大小影响性能.
Working with a large pandas DataFrame that needs to be dumped into a PostgreSQL table. From what I've read it's not a good idea to dump all at once, (and I was locking up the db) rather use the chunksize
parameter. The answers here are helpful for workflow, but I'm just asking about the value of chunksize affecting performance.
In [5]: df.shape
Out[5]: (24594591, 4)
In [6]: df.to_sql('existing_table',
con=engine,
index=False,
if_exists='append',
chunksize=10000)
有没有推荐的默认值,把参数调高或调低,性能上有区别吗?假设我有内存支持更大的块大小,它会执行得更快吗?
Is there a recommended default and is there a difference in performance when setting the parameter higher or lower? Assuming I have the memory to support a larger chunksize, will it execute faster?
推荐答案
在我的例子中,当我使用 Pandas to_sql
函数参数作为 chunksize=5000 和方法='多'.这是一个巨大的改进,因为使用 python 将 300 万行插入到数据库中对我来说变得非常困难.
In my case, 3M rows having 5 columns were inserted in 8 mins when I used pandas to_sql
function parameters as chunksize=5000 and method='multi'. This was a huge improvement as inserting 3M rows using python into the database was becoming very hard for me.
这篇关于pandas.DataFrame.to_sql 中的最佳块大小参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!