使用machinefile进行MPI集群的ipython [英] ipython with MPI clustering using machinefile

查看:256
本文介绍了使用machinefile进行MPI集群的ipython的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据mpi4py演示目录中hellowworld.py脚本的测试,我已成功配置mpi,并在三个节点上支持mpi4py:



gms @ host:〜/ development / mpi $ mpiexec -f machinefile -n 10 python~ / development / mpi4py / demo / helloworld.py

 你好,世界!我是主持人的第3个过程。 
你好,世界!我是关于worker1的10个进程。
你好,世界!我是主持人中的第6个进程。
你好,世界!我是关于worker2的10个进程中的2个。
你好,世界!我是关于worker1的10个进程中的4个。
你好,世界!我是主持人的第9个进程。
你好,世界!我是关于worker2的10个进程中的第5个。
你好,世界!我是关于worker1的第7个过程。
你好,世界!我是关于worker2的8个进程中的8个。
你好,世界!我是主持人的第10个进程。

我现在正试图让这个工作在ipython并将我的机器文件添加到我的$ IPYTHON_DIR / profile_mpi /ipcluster_config.py文件,如下所示:

  c.MPILauncher.mpi_args = [ -  machinefile,/ home / gms / development / mpi / machinefile] 

然后我使用以下命令在我的头节点上启动iPython notebook: ipython notebook --profile = mpi --ip = * --port = 9999 --no-browser&



<瞧,瞧,我可以从本地网络上的其他设备上访问它。但是,当我从iPython笔记本运行helloworld.py时,我只得到头节点的响应: Hello,World!我是主机上的10个进程0。



我从iPython开始使用10个引擎的mpi,但是......



我进一步配置了这些参数,以防万一<$ p>

在$ IPYTHON_DIR / profile_mpi / ipcluster_config.py

  c.IPClusterEngines.engine_launcher_class ='MPIEngineSetLauncher'

in $ IPYTHON_DIR / profile_mpi / ipengine_config.py

  c.MPI.use ='mpi4py'
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ p> c.HubFactory.ip ='*'

然而,这些确实也没帮助。



为了让这个工作正常,我错过了什么?



编辑更新1



我现在在工作节点上安装了NFS目录因此,我正在满足要求目前ipcluster要求IPYTHONDIR / profile_ / security目录存在于控制器和引擎都可以看到的共享文件系统上。能够使用 ipcluster 启动我的控制器和引擎,使用命令 ipcluster start --profile = mpi -n 6&



所以,我在我的头节点发出这个,然后得到:



2016-03-04 20:31:26.280 [IPClusterStart]使用[daemon = False]启动ipcluster
2016-03-04 20:31:26.283 [IPClusterStart]创建pid文件:/ home / gms / .config / ipython / profile_mpi / pid / ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart]使用LocalControllerLauncher启动控制器
2016-03-04 20:31:27.282 [ IPClusterStart]使用MPIEngineSetLauncher启动6个引擎
2016-03-04 20:31:57.301 [IPClusterStart]引擎似乎已成功启动



然后,继续发出相同的命令来启动其他节点上的引擎,但我得到:



2016-03-04 20 :31:33.092 [IPClusterStart]删除pid文件:/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.0 95 [IPClusterStart]使用[daemon = False]启动ipcluster
2016-03-04 20:31:33.100 [IPClusterStart]创建pid文件:/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart]使用LocalControllerLauncher启动控制器
2016-03-04 20:31:34.098 [IPClusterStart]使用MPIEngineSetLauncher启动6个引擎
[1] +已停止ipcluster启动--profile = mpi -n 6



未确认引擎似乎已启动成功 ...



当我执行 ps au 时更加令人困惑工作节点,我得到:

  gms 3862 0.1 2.5 38684 23740 pts / 0 T 20:31 0:01 / usr / bin / python / usr / bin / ipcluster start --profile = mpi -n 6 
gms 3874 0.1 1.7 21428 16772 pts / 0 T 20:31 0:01 / usr / bin / python -c来自IPython.parallel .apps.ipcontrollerapp import launch_new_instance; launch_new_instance() - profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts / 0 T 20:31 0:00 mpiexec -n 6 -machinefile / home / gms / development / mpi / machinefile / usr / bin / python -c来自IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts / 0 T 20:31 0:00 / usr / bin / ssh -x 192.168.1.1 / usr / bin / hydra_pmi_proxy--control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts / 0 T 20 :31 0:00 / usr / bin / hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts / 0 T 20:31 0:00 / usr / bin / ssh -x 192.168.1.201/ usr / bin / hydra_pmi_proxy--control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts / 0 T 20:31 0:00 / usr / bin / python -c来自IPython.parallel.apps。 ipengineapp import launch_new_instance; launch_new_instance() - profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts / 0 T 20:31 0:00 / usr / bin / python -c来自IPython.parallel.apps。 ipengineapp import launch_new_instance; launch_new_instance() - profile-dir /home/gms/.config

进程中的ip地址3376和3378来自群集中的其他主机。但是......



当我直接使用ipython运行类似的测试时,我得到的是来自localhost的响应(即使,减去,ipython,这直接与mpi和mpi4py,如我原帖中所述):

  gms @ head:〜/ development / mpi $ ipython test.py 
头[3834]:0/1

gms @ head:〜/ development / mpi $ mpiexec -f machinefile -n 10 ipython test.py
worker1 [3961]:4 / 10
worker1 [3962]:7/10
头[3946]:6/10
头[3944]:0/10
worker2 [4054]:5/10
worker2 [4055]:8/10
head [3947]:9/10
worker1 [3960]:1/10
worker2 [4053]:2/10
头[3945]:3/10

我似乎仍然缺少明显的东西,尽管我我确信我的配置现在是正确的。突然出现的一件事是,当我在工作节点上启动 ipcluster 时,我明白了: 2016-03-04 20:31:33.092 [ IPClusterStart]删除pid文件:/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid



编辑更新2



这更像是为了记录正在发生的事情,并希望最终能够实现这一目标:



我清理了我的日志文件并重新发布 ipcluster start --profile = mpi -n 6&



现在看到我的引擎有6个日志文件,我的控制器有1个:

  drwxr-xr-x 2 gms gms 12288 3月6日03:28。 
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r - r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r - r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r - r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670。 log
-rw-r - r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r - r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r - r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r - r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log

在ipcontroller的日志中查看它看起来只有一个引擎已注册:

  2016-03-06 03:28:12.469 [IPControllerApp] Hub监听tcp:// *:34540 for注册。 
2016-03-06 03:28:12.480 [IPControllerApp] Hub使用DB后端:'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub :: created hub
2016-03-06 03:28:12.751 [IPControllerApp]将连接信息写入/home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28: 12.754 [IPControllerApp]将连接信息写入/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task :: using Python leastload任务调度程序
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor启动
2016-03-06 03:28:12.808 [IPControllerApp]创建pid文件:/home/gms/.config/ IPython中/ profile_mpi / PID / ipcontroller.pid
2016年3月6日03:28:14.792 [IPControllerApp]客户端::客户端请求a8441250-d3d7-4a0b-8210-dae327665450 'registration_request'
2016 -03-06 03:28:14.800 [IPControllerApp] client :: client'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'required'registration_request '
2016-03-06 03:28:18.764 [IPControllerApp]注册::完成注册引擎1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28: 18.768 [IPControllerApp] engine :: Engine Connected:1
2016-03-06 03:28:20.800 [IPControllerApp] registration :: purging stalled registration:0

不应该注册6个引擎中的每个引擎吗?



2个引擎的日志看起来像是注册了罚款:

  2016-03-06 03:28:13.746 [IPEngineApp]初始化MPI:
2016-03- 06 03:28:13.746 [IPEngineApp]从mpi4py进口MPI作为MPI
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.735 [IPEngineApp]正在加载url_file u'/ home / gms / .config / ipython / profile_mpi / security / ipcontroller-engine.json'
2016-03-06 03 :28:14.780 [IPEngineApp]在tcp://127.0.0.1注册控制器:34540
2016-03-06 03:28:15.282 [IPEngineApp]使用ex isting profile目录:
u'/ home / gms / .config / ipython / profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp]已完成注册,ID为1

而另一个注册 id 0



但是,其他4个引擎给出了超时错误:

  2016-03-06 03 :28:14.676 [IPEngineApp]初始化MPI:
2016-03-06 03:28:14.689 [IPEngineApp]来自mpi4py导入MPI为mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.733 [IPEngineApp]正在加载url_file u'/ home / gms / .config / ipython / profile_mpi / security / ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp]在tcp://127.0.0.1注册控制器:34540
2016-03-06 03:28 :16.807 [IPEngineApp]注册在2.0秒后超时

嗯......我想我可以尝试一下明天重新安装ipython。



编辑更新3



安装了ipython的冲突版本(看起来像是通过apt-get和pip)。使用卸载并重新安装pip install ipython [all] ...



编辑更新4



我希望有人发现这个有用,我希望有人可以在某些方面权衡一下,以帮助澄清一些事情。



Anywho,我安装了一个virtualenv来处理孤立我的环境,我觉得它看起来有些程度的成功。我在每个节点上启动'ipcluster start -n 4 --profile = mpi',然后ssh回到我的头节点并运行一个测试脚本,该脚本首先调用 ipcluster 。以下输出:因此,它正在进行一些并行计算。



但是,当我运行查询所有节点的测试脚本时,我只得到头节点:





但是,如果我只是直接运行 mpiexec 命令,一切都是hunky dory。



为了增加混乱,如果我查看节点上的进程,我会看到各种各样的行为来表明它们是共同努力:



我的日志中没有任何异常。为什么我没有在第二个测试脚本中返回节点(代码包含在这里:):

 #test_mpi.py 
从mpi4py import导入os
导入套接字
MPI

MPI = MPI.COMM_WORLD

print({host} [{pid}]: {rank} / {size}。format(
host = socket.gethostname(),
pid = os.getpid(),
rank = MPI.rank,
size = MPI.size,
))


解决方案

不知道为什么,但我重新创建了我的ipcluster_config.py文件并再次向它添加了c.MPILauncher.mpi_args = [ - machinefile,path_to_file / machinefile],这次它起作用 - 出于某种奇怪的原因。我可以发誓之前我曾经有过这个,但是唉......


I have successfully configured mpi with mpi4py support across three nodes, as per testing of the hellowworld.py script in the mpi4py demo directory:

gms@host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py

Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.

I am now trying to get this working in ipython and have added my machinefile to my $IPYTHON_DIR/profile_mpi/ipcluster_config.py file, as follows:

c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]

I then start iPython notebook on my head node using the command: ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &

and, voila, I can access it just fine from another device on my local network. However, when I run helloworld.py from iPython notebook, I only get a response from the head node: Hello, World! I am process 0 of 10 on host.

I started mpi from iPython with 10 engines, but...

I further configured these parameters, just in case

in $IPYTHON_DIR/profile_mpi/ipcluster_config.py

c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'

in $IPYTHON_DIR/profile_mpi/ipengine_config.py

c.MPI.use = 'mpi4py'

in $IPYTHON_DIR/profile_mpi/ipcontroller_config.py

c.HubFactory.ip = '*'

However, these did not help, either.

What am I missing to get this working correctly?

EDIT UPDATE 1

I now have NFS mounted directories on my worker nodes, and thus, am fulfilling the requirement "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." to be able to use ipcluster to start my controller and engines, using the command ipcluster start --profile=mpi -n 6 &.

So, I issue this on my head node, and then get:

2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False] 2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher 2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher 2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully

Then, proceed to issue the same command to start the engines on the other nodes, but I get:

2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False] 2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher 2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher [1]+ Stopped ipcluster start --profile=mpi -n 6

with no confirmation that the Engines appear to have started successfully ...

Even more confusing, when I do a ps au on the worker nodes, I get:

gms       3862  0.1  2.5  38684 23740 pts/0    T    20:31   0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms       3874  0.1  1.7  21428 16772 pts/0    T    20:31   0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms       3875  0.0  0.2   4768  2288 pts/0    T    20:31   0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms       3876  0.0  0.4   5732  4132 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms       3877  0.0  0.1   4816  1204 pts/0    T    20:31   0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms       3878  0.0  0.4   5732  4028 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms       3879  0.0  0.6   8944  6008 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms       3880  0.0  0.6   8944  6108 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config

Where the ip addresses in processes 3376 and 3378 are from the other hosts in the cluster. But...

When I run a similar test directly using ipython, all I get is a response from the localhost (even though, minus, ipython, this works directly with mpi and mpi4py as noted in my original post):

gms@head:~/development/mpi$ ipython test.py
head[3834]: 0/1

gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10

I still seem to be missing something obvious, although I am convinced my configuration is now correct. One thing that pops out, is when I start ipcluster on my worker nodes, I get this: 2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid

EDIT UPDATE 2

This is more to document what is happening and, hopefully, ultimately what gets this working:

I cleaned out my log files and reissued ipcluster start --profile=mpi -n 6 &

And now see 6-log files for my engines, and 1 for my controller:

drwxr-xr-x 2 gms gms 12288 Mar  6 03:28 .
drwxr-xr-x 7 gms gms  4096 Mar  6 03:31 ..
-rw-r--r-- 1 gms gms  1313 Mar  6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4629.log 

Looking in the log for ipcontroller it looks like only one engine registered:

2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0

Shouldn't each of the 6 engines be registered?

2 of the engine's logs look like they registered fine:

2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file     u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:    
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1

while the other registered with id 0

But, the other 4 engines gave a time out error:

2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds

Hmmm... I think I may try a reinstall of ipython tomorrow.

EDIT UPDATE 3

Conflicting versions of ipython were installed (looks like through apt-get and pip). Uninstalling and reinstall using pip install ipython[all]...

EDIT UPDATE 4

I hope someone is finding this useful AND I hope someone can weigh in at some point to help clarify a few things.

Anywho, I installed a virtualenv to deal isolate my environment, and it looks like some degree of success, I think. I fired up 'ipcluster start -n 4 --profile=mpi' on each of my nodes, then ssh'ed back into my head node and ran a test script, which first calls ipcluster. The following output: So, it is doing some parallel computing.

However, when I run my test script that queries all the nodes, I just get the head node:

But, again, if I just run the straight up mpiexec command, everything is hunky dory.

To add to the confusion, if I look at the processes on the nodes, I see all sorts of behavior to indicate they are working together:

And nothing out of the ordinary in my logs. Why am I not getting nodes returned in my second test script (code included here:):

# test_mpi.py
import os
import socket
from mpi4py import MPI

MPI = MPI.COMM_WORLD

print("{host}[{pid}]: {rank}/{size}".format(
    host=socket.gethostname(),
    pid=os.getpid(),
    rank=MPI.rank,
    size=MPI.size,
))

解决方案

Not sure why, but I recreated my ipcluster_config.py file and again added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...

这篇关于使用machinefile进行MPI集群的ipython的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆