MPICH通信失败 [英] MPICH communication failed

查看:218
本文介绍了MPICH通信失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的MPICH程序,其中的进程发送&

我已设置为2个相同的虚拟机,并确保网络正常运行。我已经在两台机器上测试了一个简单的MPICH程序,并且运行良好。
当我尝试在上述程序之类的不同机器上的进程之间进行通信时,就会出现问题。我收到以下错误:

I have a simple MPICH program in which processes send & receive messages from each other in a Ring order.
I've setup to 2 identical virtual machine, and made sure network is working fine. I've tested a simple MPICH program both machines and it works fine. The problem arises when I try to communicate between processes on different machines like the above program. I'm getting the following error:


MPI_Send中的致命错误:进程失败,错误堆栈:

MPI_Send(171)......:MPI_Send(buf = 0xbfed8c08,count = 1,MPI_INT,dest = 1,

tag = 1,MPI_COMM_WORLD)失败

MPID_nem_tcp_connpoll(1826):通讯错误,等级1:连接被拒绝

Fatal error in MPI_Send: A process has failed, error stack:
MPI_Send(171)...............: MPI_Send(buf=0xbfed8c08, count=1, MPI_INT, dest=1,
tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection refused




  • SSH是无密码的&

  • / etc / hosts 配置正确。

  • 防火墙

  • 已配置的NFS客户端/服务器并在它们之间共享目录。
  • (根据
  • 同时尝试了MPICH和带有Hydra的OpenMPI

    • SSH is passwordless & works fine on both sides.
    • /etc/hosts is configured properly.
    • Firewall is disabled on both machines.
    • Configured NFS Client/Server and shared a directory between them. (According to this)
    • Tried both MPICH & OpenMPI with Hydra
    • 推荐答案

      这就是我所做的,而且有效!

      Here what i did, And it works!

      使用源代码(压缩包)安装在以下软件包中

      Installed following package using source code (tarball)

      hydra 
      openmpi
      

      已创建的主机文件(均为节点)

      Created hosts file (both node)

      # cat /home/spatel/mpi/hydra/hosts
      node1
      node2 
      

      在.bashrc中(两个节点上)设置变量

      Set variable in .bashrc on (both node)

      echo HYDRA_HOST_FILE=/home/spatel/mpi/hydra/hosts >> ~/.bashrc
      

      使用HelloWorld MPI程序在单个节点上运行。

      Use HelloWorld MPI program to run on single node.

      node1# /home/spatel/mpi/hydra/bin/mpiexec -np 1 /home/spatel/mpi/mpi_hello_world
      Hello world from processor node1.example.com, rank 0 out of 1 processors
      

      在使用 -machinefile 选项 -np 的多节点是处理器数量

      Run on multiple node using -machinefile option -np is number of processor

      node1# /home/spatel/mpi/hydra/bin/mpiexec -np 4 -machinefile /home/spatel/mpi/hydra/hosts /home/spatel/mpi/mpi_hello_world
      Hello world from processor node1.example.com, rank 0 out of 1 processors
      Hello world from processor node2.example.com, rank 0 out of 1 processors
      Hello world from processor node1.example.com, rank 0 out of 1 processors
      Hello world from processor node2.example.com, rank 0 out of 1 processors
      

      这篇关于MPICH通信失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆