Python 3 Multiprocessing Pool在使用大变量时速度较慢 [英] Python 3 Multiprocessing Pool is slow with large variables
问题描述
在Python 3中使用多处理池时,我遇到了一个非常特殊的问题.请参见下面的代码:
I'm running into a very peculiar issue with using multiprocessing pools in Python 3... See the code below:
import multiprocessing as MP
class c(object):
def __init__(self):
self.foo = ""
def a(self, b):
return b
def main(self):
with open("/path/to/2million/lines/file", "r") as f:
self.foo = f.readlines()
o = c()
o.main()
p = MP.Pool(5)
for r in p.imap(o.a, range(1,10)):
print(r)
如果我按原样执行此代码,这将是我极其缓慢的结果:
If I execute this code as is, this is my extremely slow result:
1
2
3
4
5
6
7
8
9
real 0m6.641s
user 0m7.256s
sys 0m1.824s
但是,如果我删除了o.main()
行,则执行时间将大大缩短:
However, if i removed the line o.main()
, then I get much faster execution time:
1
2
3
4
5
6
7
8
9
real 0m0.155s
user 0m0.048s
sys 0m0.004s
我的环境有足够的力量,并且我确保我没有遇到任何内存限制.我还用一个较小的文件进行了测试,执行时间更可接受.有见识吗?
My environment has plenty of power, and I've made sure I'm not running into any memory limits. I also tested it with a smaller file, and execution time is much more acceptable. Any insight?
我删除了磁盘IO部分,而只是创建了一个列表.我可以证明磁盘IO与该问题无关...
I removed the disk IO part, and just created a list instead. I can prove the disk IO has nothing to do with the problem...
for i in range(1,500000):
self.foo.append("foobar%d\n"%i)
real 0m1.763s user 0m1.944s sys 0m0.452s
for i in range(1,1000000):
self.foo.append("foobar%d\n"%i)
real 0m3.808s user 0m4.064s sys 0m1.016s
推荐答案
在后台,multiprocessing.Pool
使用Pipe
将数据从父进程传输到Pool工作器.
Under the hood, multiprocessing.Pool
uses a Pipe
to transfer the data from the parent process to the Pool workers.
This adds a hidden cost to the scheduling of tasks as the entire o
object gets serialised into a Pickle
object and transferred via an OS pipe.
此操作针对您计划的每个任务(在您的示例中为10次)完成.如果您的文件大小为10 Mb,则说明您要转移100Mb的数据.
This is done for each and every task you are scheduling (10 times in your example). If your file is 10 Mb in size, you are shifting 100Mb of data.
根据《多处理编程指南》 :
应尽可能避免在进程之间转移大量数据.
As far as possible one should try to avoid shifting large amounts of data between processes.
一种加快逻辑处理速度的简单方法是计算文件中的行数,将它们分成相等的块,仅将行索引发送到工作进程,并让它们open
文件,seek
右行并处理数据.
A simple way to speed up your logic would be calculating the amount of lines in your file, splitting them in equal chunks, sending only the line indexes to the worker processes and let them open
the file, seek
the right line and process the data.
这篇关于Python 3 Multiprocessing Pool在使用大变量时速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!