pandas python中的并行处理 [英] parallel processing in pandas python
本文介绍了 pandas python中的并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的数据框中有5,000,000行.在我的代码中,我正在使用iterrows(),这花费了太多时间.为了获得所需的输出,我必须遍历所有行.所以我想知道我是否可以并行化pandas中的代码.
I have 5,000,000 rows in my dataframe. In my code, I am using iterrows() which is taking too much time. To get the required output, I have to iterate through all the rows . So I wanted to know whether I can parallelize the code in pandas.
推荐答案
以下是我发现可能会有所帮助的网页: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
Here's a webpage I found that might help: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
这是在该页面中找到的用于多处理的代码:
And here's the code for multiprocessing found in that page:
import pandas as pd
import multiprocessing as mp
LARGE_FILE = "D:\\my_large_file.txt"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)
这篇关于 pandas python中的并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文