使用machinefile进行MPI集群的ipython [英] ipython with MPI clustering using machinefile

查看：256 发布时间：2018/11/15 12:57:39 ipython cluster-computing mpich ipython-parallel mpi4py

本文介绍了使用machinefile进行MPI集群的ipython的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据mpi4py演示目录中hellowworld.py脚本的测试，我已成功配置mpi，并在三个节点上支持mpi4py：

gms @ host：〜/ development / mpi $ mpiexec -f machinefile -n 10 python~ / development / mpi4py / demo / helloworld.py

 你好，世界！我是主持人的第3个过程。 
你好，世界！我是关于worker1的10个进程。 
你好，世界！我是主持人中的第6个进程。 
你好，世界！我是关于worker2的10个进程中的2个。 
你好，世界！我是关于worker1的10个进程中的4个。 
你好，世界！我是主持人的第9个进程。 
你好，世界！我是关于worker2的10个进程中的第5个。 
你好，世界！我是关于worker1的第7个过程。 
你好，世界！我是关于worker2的8个进程中的8个。 
你好，世界！我是主持人的第10个进程。

我现在正试图让这个工作在ipython并将我的机器文件添加到我的$ IPYTHON_DIR / profile_mpi /ipcluster_config.py文件，如下所示：

  c.MPILauncher.mpi_args = [ -  machinefile，/ home / gms / development / mpi / machinefile]

然后我使用以下命令在我的头节点上启动iPython notebook： ipython notebook --profile = mpi --ip = * --port = 9999 --no-browser&

<瞧，瞧，我可以从本地网络上的其他设备上访问它。但是，当我从iPython笔记本运行helloworld.py时，我只得到头节点的响应： Hello，World！我是主机上的10个进程0。

我从iPython开始使用10个引擎的mpi，但是......

我进一步配置了这些参数，以防万一<$ p>

在$ IPYTHON_DIR / profile_mpi / ipcluster_config.py

  c.IPClusterEngines.engine_launcher_class ='MPIEngineSetLauncher'

in $ IPYTHON_DIR / profile_mpi / ipengine_config.py

  c.MPI.use ='mpi4py'
 $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ p>  c.HubFactory.ip ='*'

然而，这些确实也没帮助。

 
 
 为了让这个工作正常，我错过了什么？ 
 
 
  编辑更新1  
 
 
 我现在在工作节点上安装了NFS目录因此，我正在满足要求目前ipcluster要求IPYTHONDIR / profile_ / security目录存在于控制器和引擎都可以看到的共享文件系统上。能够使用 ipcluster 启动我的控制器和引擎，使用命令 ipcluster start --profile = mpi -n 6& 。
 
 
 所以，我在我的头节点发出这个，然后得到：
 
 
   2016-03-04 20：31：26.280 [IPClusterStart]使用[daemon = False]启动ipcluster 
 2016-03-04 20：31：26.283 [IPClusterStart]创建pid文件：/ home / gms / .config / ipython / profile_mpi / pid / ipcluster.pid 
 2016-03-04 20：31：26.284 [IPClusterStart]使用LocalControllerLauncher启动控制器
 2016-03-04 20：31：27.282 [ IPClusterStart]使用MPIEngineSetLauncher启动6个引擎
 2016-03-04 20：31：57.301 [IPClusterStart]引擎似乎已成功启动 
 
 
 然后，继续发出相同的命令来启动其他节点上的引擎，但我得到：
 
 
   2016-03-04 20 ：31：33.092 [IPClusterStart]删除pid文件：/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 
 2016-03-04 20：31：33.0 95 [IPClusterStart]使用[daemon = False]启动ipcluster 
 2016-03-04 20：31：33.100 [IPClusterStart]创建pid文件：/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid 
 2016-03-04 20：31：33.111 [IPClusterStart]使用LocalControllerLauncher启动控制器
 2016-03-04 20：31：34.098 [IPClusterStart]使用MPIEngineSetLauncher启动6个引擎
 [1] +已停止ipcluster启动--profile = mpi -n 6  
 
 
 未确认引擎似乎已启动成功 ... 
 
 
 当我执行 ps au 时更加令人困惑工作节点，我得到：
  gms 3862 0.1 2.5 38684 23740 pts / 0 T 20:31 0:01 / usr / bin / python / usr / bin / ipcluster start --profile = mpi -n 6 
 gms 3874 0.1 1.7 21428 16772 pts / 0 T 20:31 0:01 / usr / bin / python -c来自IPython.parallel .apps.ipcontrollerapp import launch_new_instance; launch_new_instance（） -  profile-dir /home/gms/.co 
 gms 3875 0.0 0.2 4768 2288 pts / 0 T 20:31 0:00 mpiexec -n 6 -machinefile / home / gms / development / mpi / machinefile / usr / bin / python -c来自IPython.parallel.apps.ipengineapp import launch_new 
 gms 3876 0.0 0.4 5732 4132 pts / 0 T 20:31 0:00 / usr / bin / ssh -x 192.168.1.1 / usr / bin / hydra_pmi_proxy--control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0  -  
 gms 3877 0.0 0.1 4816 1204 pts / 0 T 20 ：31 0:00 / usr / bin / hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1 
 gms 3878 0.0 0.4 5732 4028 pts / 0 T 20:31 0:00 / usr / bin / ssh -x 192.168.1.201/ usr / bin / hydra_pmi_proxy--control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 
 gms 3879 0.0 0.6 8944 6008 pts / 0 T 20:31 0:00 / usr / bin / python -c来自IPython.parallel.apps。 ipengineapp import launch_new_instance; launch_new_instance（） -  profile-dir /home/gms/.config 
 gms 3880 0.0 0.6 8944 6108 pts / 0 T 20:31 0:00 / usr / bin / python -c来自IPython.parallel.apps。 ipengineapp import launch_new_instance; launch_new_instance（） -  profile-dir /home/gms/.config 
  
进程中的ip地址3376和3378来自群集中的其他主机。但是...... 
 
 
 当我直接使用ipython运行类似的测试时，我得到的是来自localhost的响应（即使，减去，ipython，这直接与mpi和mpi4py，如我原帖中所述）：
  gms @ head：〜/ development / mpi $ ipython test.py 
头[3834]：0/1 
 
 gms @ head：〜/ development / mpi $ mpiexec -f machinefile -n 10 ipython test.py 
 worker1 [3961]：4 / 10 
 worker1 [3962]：7/10 
头[3946]：6/10 
头[3944]：0/10 
 worker2 [4054]：5/10 
 worker2 [4055]：8/10 
 head [3947]：9/10 
 worker1 [3960]：1/10 
 worker2 [4053]：2/10 
头[3945]：3/10 
  
我似乎仍然缺少明显的东西，尽管我我确信我的配置现在是正确的。突然出现的一件事是，当我在工作节点上启动 ipcluster 时，我明白了： 2016-03-04 20：31：33.092 [ IPClusterStart]删除pid文件：/home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid  
 
 
  编辑更新2  
 
 
 这更像是为了记录正在发生的事情，并希望最终能够实现这一目标：
 
 
 我清理了我的日志文件并重新发布 ipcluster start --profile = mpi -n 6&  
 
 
 现在看到我的引擎有6个日志文件，我的控制器有1个：
  drwxr-xr-x 2 gms gms 12288 3月6日03:28。 
 drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 .. 
 -rw-r  -  r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log 
 -rw-r  -  r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log 
 -rw-r  -  r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670。 log 
 -rw-r  -  r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log 
 -rw-r  -  r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log 
 -rw-r  -  r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log 
 -rw-r  -  r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log 
  
在ipcontroller的日志中查看它看起来只有一个引擎已注册：
  2016-03-06 03：28：12.469 [IPControllerApp] Hub监听tcp：// *：34540 for注册。 
 2016-03-06 03：28：12.480 [IPControllerApp] Hub使用DB后端：'NoDB'
 2016-03-06 03：28：12.749 [IPControllerApp] hub :: created hub 
 2016-03-06 03：28：12.751 [IPControllerApp]将连接信息写入/home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json 
 2016-03-06 03:28： 12.754 [IPControllerApp]将连接信息写入/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json 
 2016-03-06 03：28：12.758 [IPControllerApp] task :: using Python leastload任务调度程序
 2016-03-06 03：28：12.760 [IPControllerApp] Heartmonitor启动
 2016-03-06 03：28：12.808 [IPControllerApp]创建pid文件：/home/gms/.config/ IPython中/ profile_mpi / PID / ipcontroller.pid 
 2016年3月6日03：28：14.792 [IPControllerApp]客户端::客户端请求a8441250-d3d7-4a0b-8210-dae327665450 'registration_request' 
 2016 -03-06 03：28：14.800 [IPControllerApp] client :: client'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'required'registration_request '
 2016-03-06 03：28：18.764 [IPControllerApp]注册::完成注册引擎1：'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
 2016-03-06 03:28： 18.768 [IPControllerApp] engine :: Engine Connected：1 
 2016-03-06 03：28：20.800 [IPControllerApp] registration :: purging stalled registration：0 
  
不应该注册6个引擎中的每个引擎吗？
 
 
  2个引擎的日志看起来像是注册了罚款：
  2016-03-06 03：28：13.746 [IPEngineApp]初始化MPI：
 2016-03- 06 03：28：13.746 [IPEngineApp]从mpi4py进口MPI作为MPI 
 mpi.size = mpi.COMM_WORLD.Get_size（）
 mpi.rank = mpi.COMM_WORLD.Get_rank（）
 
 2016-03-06 03：28：14.735 [IPEngineApp]正在加载url_file u'/ home / gms / .config / ipython / profile_mpi / security / ipcontroller-engine.json'
 2016-03-06 03 ：28：14.780 [IPEngineApp]在tcp：//127.0.0.1注册控制器：34540 
 2016-03-06 03：28：15.282 [IPEngineApp]使用ex isting profile目录：
 u'/ home / gms / .config / ipython / profile_mpi'
 2016-03-06 03：28：15.286 [IPEngineApp]已完成注册，ID为1 
  
而另一个注册 id 0  
 
 
 但是，其他4个引擎给出了超时错误：
  2016-03-06 03 ：28：14.676 [IPEngineApp]初始化MPI：
 2016-03-06 03：28：14.689 [IPEngineApp]来自mpi4py导入MPI为mpi 
 mpi.size = mpi.COMM_WORLD.Get_size（）
 mpi.rank = mpi.COMM_WORLD.Get_rank（）
 
 2016-03-06 03：28：14.733 [IPEngineApp]正在加载url_file u'/ home / gms / .config / ipython / profile_mpi / security / ipcontroller-engine.json'
 2016-03-06 03：28：14.805 [IPEngineApp]在tcp：//127.0.0.1注册控制器：34540 
 2016-03-06 03:28 ：16.807 [IPEngineApp]注册在2.0秒后超时
  
嗯......我想我可以尝试一下明天重新安装ipython。
 
 
  编辑更新3  
 
 
 安装了ipython的冲突版本（看起来像是通过apt-get和pip）。使用卸载并重新安装pip install ipython [all]  ... 
 
 
  编辑更新4  
 
 
 我希望有人发现这个有用，我希望有人可以在某些方面权衡一下，以帮助澄清一些事情。 
 
 
  Anywho，我安装了一个virtualenv来处理孤立我的环境，我觉得它看起来有些程度的成功。我在每个节点上启动'ipcluster start -n 4 --profile = mpi'，然后ssh回到我的头节点并运行一个测试脚本，该脚本首先调用 ipcluster 。以下输出：因此，它正在进行一些并行计算。 
 
 
 但是，当我运行查询所有节点的测试脚本时，我只得到头节点：
 
 
   
 
 
 但是，如果我只是直接运行 mpiexec 命令，一切都是hunky dory。
 
 
 为了增加混乱，如果我查看节点上的进程，我会看到各种各样的行为来表明它们是共同努力： 
 
 
 我的日志中没有任何异常。为什么我没有在第二个测试脚本中返回节点（代码包含在这里:)：
 ＃test_mpi.py 
从mpi4py import导入os 
导入套接字
 MPI 
 
 MPI = MPI.COMM_WORLD 
 
 print（{host} [{pid}]： {rank} / {size}。format（
 host = socket.gethostname（），
 pid = os.getpid（），
 rank = MPI.rank，
 size = MPI.size，
））
  
 
 
解决方案
不知道为什么，但我重新创建了我的ipcluster_config.py文件并再次向它添加了c.MPILauncher.mpi_args = [ -  machinefile，path_to_file / machinefile]，这次它起作用 - 出于某种奇怪的原因。我可以发誓之前我曾经有过这个，但是唉...... 
 
I have successfully configured mpi with mpi4py support across three nodes, as per testing of the hellowworld.py script in the mpi4py demo directory:

gms@host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
I am now trying to get this working in ipython and have added my machinefile to my $IPYTHON_DIR/profile_mpi/ipcluster_config.py file, as follows:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
I then start iPython notebook on my head node using the command: ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &

and, voila, I can access it just fine from another device on my local network. However, when I run helloworld.py from iPython notebook, I only get a response from the head node: Hello, World! I am process 0 of 10 on host.

I started mpi from iPython with 10 engines, but...

I further configured these parameters, just in case 

in $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
in $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
in $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
However, these did not help, either.

What am I missing to get this working correctly? 

EDIT UPDATE 1

I now have NFS mounted directories on my worker nodes, and thus, am fulfilling the requirement "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." to be able to use ipcluster to start my controller and engines, using the command ipcluster start --profile=mpi -n 6 &.

So, I issue this on my head node, and then get:

2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully

Then, proceed to issue the same command to start the engines on the other nodes, but I get:

2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+  Stopped                 ipcluster start --profile=mpi -n 6

with no confirmation that the Engines appear to have started successfully ...

Even more confusing, when I do a ps au on the worker nodes, I get:
gms       3862  0.1  2.5  38684 23740 pts/0    T    20:31   0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms       3874  0.1  1.7  21428 16772 pts/0    T    20:31   0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms       3875  0.0  0.2   4768  2288 pts/0    T    20:31   0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms       3876  0.0  0.4   5732  4132 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms       3877  0.0  0.1   4816  1204 pts/0    T    20:31   0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms       3878  0.0  0.4   5732  4028 pts/0    T    20:31   0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms       3879  0.0  0.6   8944  6008 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms       3880  0.0  0.6   8944  6108 pts/0    T    20:31   0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
Where the ip addresses in processes 3376 and 3378 are from the other hosts in the cluster. But...

When I run a similar test directly using ipython, all I get is a response from the localhost (even though, minus, ipython, this works directly with mpi and mpi4py as noted in my original post):
gms@head:~/development/mpi$ ipython test.py
head[3834]: 0/1

gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
I still seem to be missing something obvious, although I am convinced my configuration is now correct. One thing that pops out, is when I start ipcluster on my worker nodes, I get this: 2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid

EDIT UPDATE 2

This is more to document what is happening and, hopefully, ultimately what gets this working:

I cleaned out my log files and reissued ipcluster start --profile=mpi -n 6 &

And now see 6-log files for my engines, and 1 for my controller:
drwxr-xr-x 2 gms gms 12288 Mar  6 03:28 .
drwxr-xr-x 7 gms gms  4096 Mar  6 03:31 ..
-rw-r--r-- 1 gms gms  1313 Mar  6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms   598 Mar  6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms   499 Mar  6 03:28 ipengine-4629.log 
Looking in the log for ipcontroller it looks like only one engine registered:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
Shouldn't each of the 6 engines be registered?

2 of the engine's logs look like they registered fine:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file     u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:    
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
while the other registered with id 0

But, the other 4 engines gave a time out error:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()

2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
Hmmm... I think I may try a reinstall of ipython tomorrow.

EDIT UPDATE 3

Conflicting versions of ipython were installed (looks like through apt-get and pip). Uninstalling and reinstall using pip install ipython[all]... 

EDIT UPDATE 4

I hope someone is finding this useful AND I hope someone can weigh in at some point to help clarify a few things. 

Anywho, I installed a virtualenv to deal isolate my environment, and it looks like some degree of success, I think. I fired up 'ipcluster start -n 4 --profile=mpi' on each of my nodes, then ssh'ed back into my head node and ran a test script, which first calls ipcluster. The following output:  So, it is doing some parallel computing. 

However, when I run my test script that queries all the nodes, I just get the head node:



But, again, if I just run the straight up mpiexec command, everything is hunky dory.

To add to the confusion, if I look at the processes on the nodes, I see all sorts of behavior to indicate they are working together: 

And nothing out of the ordinary in my logs. Why am I not getting nodes returned in my second test script (code included here:):
# test_mpi.py
import os
import socket
from mpi4py import MPI

MPI = MPI.COMM_WORLD

print("{host}[{pid}]: {rank}/{size}".format(
    host=socket.gethostname(),
    pid=os.getpid(),
    rank=MPI.rank,
    size=MPI.size,
))

 解决方案 
Not sure why, but I recreated my ipcluster_config.py file and again added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas... 

                        这篇关于使用machinefile进行MPI集群的ipython的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用machinefile进行MPI集群的ipython [英] ipython with MPI clustering using machinefile

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

使用machinefile进行MPI集群的ipython [英] ipython with MPI clustering using machinefile

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭