如何使用并行插入语句在MySQL表中插入大 pandas 数据框? [英] How to Insert Huge Pandas Dataframe in MySQL table with Parallel Insert Statement?

查看:160
本文介绍了如何使用并行插入语句在MySQL表中插入大 pandas 数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在一个项目中,我必须编写一个具有数百万行和约25列(大多数为数字类型)的数据框.我正在使用 Pandas DataFrame转换为SQL函数将数据帧转储到Mysql表中.我发现此函数创建了一个Insert语句,该语句可以一次插入多行.这是一个好方法,但是MySQL限制了使用此方法可以构建的查询的长度.

I am working on a project where I have to write a data frame with Millions of rows and about 25 columns mostly of numeric type. I am using Pandas DataFrame to SQL Function to dump the dataframe in Mysql table. I have found this function creates an Insert statement that can insert multiple rows at once. This is a good approach but MySQL has a limitation on the length of query that can be built using this approach.

有没有一种方法可以将其并行插入同一张表中,从而加快处理速度?

Is there a way such that insert that in parallel in the same table so that I can speed up the process?

推荐答案

您可以做一些事情来实现这一目标.

You can do a few things to achieve that.

一种方法是在写入sql时使用附加参数.

One way is to use an additional argument while writing to sql.

df.to_sql(method = 'multi')

根据此文档,将"multi"传递给方法参数可让您批量插入.

According to this documentation, passing 'multi' to method argument allows you to bulk insert.

另一种解决方案是使用multiprocessing.dummy构造自定义插入函数. 这是文档的链接: https://docs. python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

Another solution is to construct a custom insert function using multiprocessing.dummy. here is the link to the documentation :https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

import math
from multiprocessing.dummy import Pool as ThreadPool

...

def insert_df(df, *args, **kwargs):
    nworkers = 4 # number of workers that executes insert in parallel fashion

    chunk = math.floor(df.shape[0] / nworkers) # number of chunks
    chunks = [(chunk * i, (chunk * i) + chunk) for i in range(nworkers)]
    chunks.append((chunk * nworkers, df.shape[0]))
    pool = ThreadPool(nworkers)

    def worker(chunk):
        i, j = chunk
        df.iloc[i:j, :].to_sql(*args, **kwargs)

    pool.map(worker, chunks)
    pool.close()
    pool.join()

....

insert_df(df, "foo_bar", engine, if_exists='append')

https://stackoverflow.com/a/42164138/5614132 中建议了第二种方法.

The second method was suggested at https://stackoverflow.com/a/42164138/5614132.

这篇关于如何使用并行插入语句在MySQL表中插入大 pandas 数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆