在python的多个内核上分配许多独立的,昂贵的操作 [英] Distribute many independent, expensive operations over multiple cores in python

查看:102
本文介绍了在python的多个内核上分配许多独立的,昂贵的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个很大的 list (1,000多个)完全独立的对象,每个对象都需要通过一些昂贵的功能(每个〜5分钟)进行操作,这是最好的将工作分配到其他核心的方法?从理论上讲,我可以将列表切成相等的部分,并使用cPickle序列化数据(花费几秒钟),并为每个块启动一个新的python进程-如果我打算使用多台计算机,可能会得出以下结论: -但这听起来像是一种骇客,而不是什么.当然,有使用多处理库执行此操作的更集成的方法吗?我是不是在想这个?

Given a large list (1,000+) of completely independent objects that each need to be manipulated through some expensive function (~5 minutes each), what is the best way to distribute the work over other cores? Theoretically, I could just cut up the list into equal parts and serialize the data with cPickle (takes a few seconds) and launch a new python processes for each chunk--and it may just come to that if I intend to use multiple computers--but this feels like more of a hack than anything. Surely there is a more integrated way to do this using a multiprocessing library? Am I over-thinking this?

谢谢.

推荐答案

对于

This sounds like a good use case for a multiprocessing.Pool; depending on exactly what you're doing, it could be as simple as

pool = multiprocessing.Pool(num_procs)
results = pool.map(the_function, list_of_objects)
pool.close()

这将独立地腌制列表中的每个对象.如果存在问题,可以通过多种方法来解决(尽管它们都有自己的问题,并且我不知道它们中的任何一个在Windows上都可以工作).由于您的计算时间相当长,因此可能无关紧要.

This will pickle each object in the list independently. If that's a problem, there are various ways to get around that (though all with their own problems and I don't know if any of them work on Windows). Since your computation times are fairly long that's probably irrelevant.

由于您要运行5分钟x 1000个项目=几天/内核数,因此您可能希望在此过程中保存部分结果并打印出一些进度信息.最简单的方法可能是让您调用的函数将其结果保存到文件或数据库等中.如果不切实际,您还可以在循环中使用apply_async并处理结果.

Since you're running this for 5 minutes x 1000 items = several days / number of cores, you probably want to do some saving of partial results along the way and print out some progress information. The easiest thing to do is probably to have your function you call save its results to a file or database or whatever; if that's not practical, you could also use apply_async in a loop and handle the results as they come in.

您还可以查看类似 joblib 之类的内容来为您解决此问题;我对它不是很熟悉,但似乎正在解决相同的问题.

You could also look into something like joblib to handle this for you; I'm not very familiar with it but it seems like it's approaching the same problem.

这篇关于在python的多个内核上分配许多独立的,昂贵的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆