Python 3 Multiprocessing Pool在使用大变量时速度较慢 [英] Python 3 Multiprocessing Pool is slow with large variables

查看:638
本文介绍了Python 3 Multiprocessing Pool在使用大变量时速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 3中使用多处理池时,我遇到了一个非常特殊的问题.请参见下面的代码:

I'm running into a very peculiar issue with using multiprocessing pools in Python 3... See the code below:

import multiprocessing as MP                                       

class c(object):                                                   
    def __init__(self):                                            
        self.foo = ""                                              

    def a(self, b):                                                
        return b                                                   

    def main(self):                                                
        with open("/path/to/2million/lines/file", "r") as f:
            self.foo = f.readlines()                               

o = c()                                                            
o.main()                                                           
p = MP.Pool(5)                                                     
for r in p.imap(o.a, range(1,10)):                                 
    print(r)                                                       

如果我按原样执行此代码,这将是我极其缓慢的结果:

If I execute this code as is, this is my extremely slow result:

1
2
3
4
5
6
7
8
9

real    0m6.641s
user    0m7.256s
sys     0m1.824s                    

但是,如果我删除了o.main()行,则执行时间将大大缩短:

However, if i removed the line o.main(), then I get much faster execution time:

1
2
3
4
5
6
7
8
9

real    0m0.155s
user    0m0.048s
sys     0m0.004s

我的环境有足够的力量,并且我确保我没有遇到任何内存限制.我还用一个较小的文件进行了测试,执行时间更可接受.有见识吗?

My environment has plenty of power, and I've made sure I'm not running into any memory limits. I also tested it with a smaller file, and execution time is much more acceptable. Any insight?

我删除了磁盘IO部分,而只是创建了一个列表.我可以证明磁盘IO与该问题无关...

I removed the disk IO part, and just created a list instead. I can prove the disk IO has nothing to do with the problem...

for i in range(1,500000):
    self.foo.append("foobar%d\n"%i)

real    0m1.763s user    0m1.944s sys     0m0.452s

for i in range(1,1000000):
    self.foo.append("foobar%d\n"%i)
real    0m3.808s user    0m4.064s sys     0m1.016s

推荐答案

在后台,multiprocessing.Pool使用Pipe将数据从父进程传输到Pool工作器.

Under the hood, multiprocessing.Pool uses a Pipe to transfer the data from the parent process to the Pool workers.

由于整个o对象被序列化为

This adds a hidden cost to the scheduling of tasks as the entire o object gets serialised into a Pickle object and transferred via an OS pipe.

此操作针对您计划的每个任务(在您的示例中为10次)完成.如果您的文件大小为10 Mb,则说明您要转移100Mb的数据.

This is done for each and every task you are scheduling (10 times in your example). If your file is 10 Mb in size, you are shifting 100Mb of data.

根据《多处理编程指南》 :

应尽可能避免在进程之间转移大量数据.

As far as possible one should try to avoid shifting large amounts of data between processes.

一种加快逻辑处理速度的简单方法是计算文件中的行数,将它们分成相等的块,仅将行索引发送到工作进程,并让它们open文件,seek右行并处理数据.

A simple way to speed up your logic would be calculating the amount of lines in your file, splitting them in equal chunks, sending only the line indexes to the worker processes and let them open the file, seek the right line and process the data.

这篇关于Python 3 Multiprocessing Pool在使用大变量时速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆