在python3中的多进程之间共享python对象 [英] share python object between multiprocess in python3
问题描述
在这里我创建了一个生产者-客户程序,父进程(生产者)创建了许多子进程(消费者),然后父进程读取文件并将数据传递给子进程.
Here I create a producer-customer program,the parent process(producer) create many child process(consumer),then parent process read file and pass data to child process.
但是,这是一个性能问题,进程间传递消息花费了太多时间(我认为).
but , here comes a performance problem,pass message between process cost too much time (I think).
例如,一个 200MB 原始数据,读取和预处理父进程所需的时间少于 8 秒,而不仅仅是将数据通过多进程传递给子进程. > pipe 将花费另外的 8 秒,而子进程完成剩余工作仅需花费 3〜4 秒.
for an example ,a 200MB original data ,parent process read and pretreat will cost less then 8 seconds , than just pass data to child process by multiprocess.pipe will cost another 8 seconds , and child processes do the remain work just cost another 3 ~ 4 seconds.
因此,一个完整的工作流花费不到18秒,并且进程之间的通信花费了40%以上的时间,它比我以前考虑的要大得多,因此我尝试了多进程.队列和经理,他们更糟.
so ,a complete work flow cost less than 18 seconds ,and more than 40% time cost on communication between process , it is much bigger than I used think about ,and I tried multiprocess.Queue and Manager ,they are worse.
我使用Windows7/Python3.4. 我在Google住了好几天,而POSH也许是一个很好的解决方案,但是它无法使用python3.4进行构建
I works with windows7 / Python3.4. I had google for several days , and POSH maybe a good solution , but it can't build with python3.4
我有3种方法:
1.有什么方法可以在Python3.4的进程之间直接共享python对象?作为POSH
或
2.是否可以将对象的指针"传递给子进程,并且子进程可以将指针"恢复到python对象?
或
3.multiprocess.Array可能是一个有效的解决方案,但是如果我想共享复杂的数据结构(例如列表),它是如何工作的?我应该基于它创建一个新的类并提供列表的接口吗?
我尝试了第三种方法,但效果更差.
我定义了这些值:
I tried the 3rd way,but it works worse.
I defined those value:
p_pos = multiprocessing.Value('i') #producer write position
c_pos = multiprocessing.Value('i') #customer read position
databuff = multiprocess.Array('c',buff_len) # shared buffer
和两个功能:
send_data(msg)
get_data()
在 send_data 函数(父进程)中,它将msg复制到databuff,并通过管道将开始和结束位置(两个整数)发送到子进程.
比在 get_data 函数(子进程)中,它接收到两个位置并从databuff复制味精.
in send_data function(parent process),it copies msg to databuff , and send the start and end position (two integer)to child process via pipe.
than in get_data function (child process) ,it received the two position and copy the msg from databuff.
最后,它的成本是使用管道@ _ @
in final,it cost twice than just use pipe @_@
是的,我尝试了Cython,结果看起来不错.
我只是将python脚本的后缀更改为.pyx并对其进行编译,程序速度提高了15%.
毫无疑问,我遇到了无法找到vcvarsall.bat"和系统找不到指定的文件"错误,我花了一整天的时间解决了第一个,而第二个则阻塞了.
最后,我发现了 Cyther ,并且遇到了所有麻烦不见了^ _ ^.
Edit 2:
Yes , I tried Cython ,and the result looks good.
I just changed my python script's suffix to .pyx and compile it ,and the program speed up for 15%.
No doubt , I met the " Unable to find vcvarsall.bat" and " The system cannot find the file specified" error , and I cost whole day for solved the first one , and blocked by the second one.
Finally , I found Cyther , and all troubles gone ^_^.
推荐答案
五个月前我在您家中.我环顾了几次,但结论是使用Python进行多处理确实有您所描述的问题:
I was at your place five month ago. I looked around few times but my conclusion is multiprocessing with Python has exactly the problem you describe :
- Pipes和Queue很好,但根据我的经验,不是适合大型物体
- Manager()代理对象很慢,除了数组和那些对象是有限的.如果要共享复杂的数据结构,请像在此一样使用命名空间: https://docs .python.org/3.6/library/multiprocessing.html
- Python中没有指针或实际内存管理功能,因此您无法共享选定的内存单元
- Pipes and Queue are good but not for big objects from my experience
- Manager() proxies objects are slow except arrays and those one are limited. if you want to share a complex data structure use a Namespace like it is done here : multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes
- Manager() has a shared list you are looking for : https://docs.python.org/3.6/library/multiprocessing.html
- There are no pointers or real memory management in Python, so you can't share selected memory cells
我通过学习C ++解决了此类问题,但这可能不是您想要阅读的内容...
I solved this kind of problem by learning C++, but it's probably not what you want to read...
这篇关于在python3中的多进程之间共享python对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!