带有分布式集群的Python多处理 [英] Python Multiprocessing with Distributed Cluster

查看:173
本文介绍了带有分布式集群的Python多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个python软件包,该软件包不仅可以在一台计算机上跨不同内核进行多处理,而且还可以跨多台计算机分布一个集群.分布式计算有很多不同的python软件包,但大多数似乎都需要更改代码才能运行(例如,前缀表示该对象位于远程计算机上).具体来说,我想要尽可能接近multiprocessing pool.map函数的东西.因此,例如,如果在一台计算机上脚本是:

I am looking for a python package that can do multiprocessing not just across different cores within a single computer, but also with a cluster distributed across multiple machines. There are a lot of different python packages for distributed computing, but most seem to require a change in code to run (for example a prefix indicating that the object is on a remote machine). Specifically, I would like something as close as possible to the multiprocessing pool.map function. So, for example, if on a single machine the script is:

from multiprocessing import Pool
pool = Pool(processes = 8)
resultlist = pool.map(function, arglist)

然后,分布式集群的伪代码将是:

Then the pseudocode for a distributed cluster would be:

from distprocess import Connect, Pool, Cluster

pool1 = Pool(processes = 8)
c = Connect(ipaddress)
pool2 = c.Pool(processes = 4)
cluster = Cluster([pool1, pool2])
resultlist = cluster.map(function, arglist)

推荐答案

如果您想要一个非常简单的解决方案,那就没有了.

If you want a very easy solution, there isn't one.

但是,有一个具有multiprocessing接口(pathos)的解决方案,该接口具有通过并行映射与远程服务器建立连接并进行多处理的能力.

However, there is a solution that has the multiprocessing interface -- pathos -- which has the ability to establish connections to remote servers through a parallel map, and to do multiprocessing.

如果您希望建立ssh隧道连接,则可以执行此操作…,或者如果使用不太安全的方法也可以,则也可以执行此操作.

If you want to have a ssh-tunneled connection, you can do that… or if you are ok with a less secure method, you can do that too.

>>> # establish a ssh tunnel
>>> from pathos.core import connect
>>> tunnel = connect('remote.computer.com', port=1234)
>>> tunnel       
Tunnel('-q -N -L55774:remote.computer.com:1234 remote.computer.com')
>>> tunnel._lport
55774
>>> tunnel._rport
1234
>>> 
>>> # define some function to run in parallel
>>> def sleepy_squared(x):
...   from time import sleep
...   sleep(1.0)
...   return x**2
... 
>>> # build a pool of servers and execute the parallel map
>>> from pathos.pp import ParallelPythonPool as Pool
>>> p = Pool(8, servers=('localhost:55774',))
>>> p.servers
('localhost:55774',)
>>> y = p.map(sleepy_squared, x)
>>> y
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

或者,您可以配置直接连接(不使用ssh)

Or, instead you could configure for a direct connection (no ssh)

>>> p = Pool(8, servers=('remote.computer.com:5678',))
# use an asynchronous parallel map
>>> res = p.amap(sleepy_squared, x)
>>> res.get()
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

有点麻烦,要使远程服务器正常工作,您必须事先在指定端口上启动在remote.computer.com上运行的服务器-并且必须确保本地主机和远程服务器上的设置主机将允许直接连接或ssh隧道连接.另外,您需要在每个主机上运行相同版本的pathospppathos分支.另外,对于ssh,您需要运行ssh-agent以允许使用ssh进行无密码登录.

It's all a bit finicky, for the remote server to work, you have to start a server running on remote.computer.com at the specified port beforehand -- and you have to make sure that both the settings on your localhost and the remote host are going to allow either the direct connection or the ssh-tunneled connection. Plus, you need to have the same version of pathos and of the pathos fork of pp running on each host. Also, for ssh, you need to have ssh-agent running to allow password-less login with ssh.

但是,希望它能正常工作……如果可以使用dill.source.importable将您的功能代码传输到远程主机上.

But then, hopefully it all works… if your function code can be transported over to the remote host with dill.source.importable.

FYI,pathos早就应该发布,而基本上,在剪切新的稳定版本之前,需要解决一些错误和界面更改.

FYI, pathos is long overdue a release, and basically, there are a few bugs and interface changes that need to be resolved before a new stable release is cut.

这篇关于带有分布式集群的Python多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆