pandas python中的并行处理 [英] parallel processing in pandas python

查看:96
本文介绍了 pandas python中的并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据框中有5,000,000行.在我的代码中,我正在使用iterrows(),这花费了太多时间.为了获得所需的输出,我必须遍历所有行.所以我想知道我是否可以并行化pandas中的代码.

I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.

推荐答案

以下是我发现可能会有所帮助的网页: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

这是在该页面中找到的用于多处理的代码:

And here's the code for multiprocessing found in that page:

import pandas as pd
import multiprocessing as mp

LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
    # process data frame
    return len(df)

if __name__ == '__main__':
    reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
    pool = mp.Pool(4) # use 4 processes

    funclist = []
    for df in reader:
        # process each data frame
        f = pool.apply_async(process_frame,[df])
        funclist.append(f)

    result = 0
    for f in funclist:
        result += f.get(timeout=10) # timeout in 10 seconds

    print "There are %d rows of data"%(result)

这篇关于 pandas python中的并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆