MPI_COMM_SPAWN的控制节点映射 [英] Controlling node mapping of MPI_COMM_SPAWN

查看:119
本文介绍了MPI_COMM_SPAWN的控制节点映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

整个问题可以概括为:我正在尝试复制对system(或fork)的调用的行为,但是在mpi环境中. (结果是您不能并行调用system.)这意味着我有一个程序在多个节点上运行,每个节点上有一个进程,然后我希望每个进程都调用一个外部程序(因此对于n节点而言)我要运行外部程序的n个副本),等待所有这些副本完成,然后继续运行原始程序.

This whole issue can be summarized that I'm trying replicate the behaviour of a call to system (or fork), but in an mpi environment. (Turns out that you can't call system in parallel.) Meaning I have a program running on many nodes, one process on each node, and then I want each process to call an external program (so for n nodes I'd have n copies of the external program running), wait for all those copies to finish, then keep running the original program.

为了在并行环境中以安全的方式实现此目的,我一直使用MPI_COMM_SPAWN和阻止发送的组合.这是我实现的一些示例父程序和子程序(代码在Fortran 90中,但是C程序的语法类似于):

To achieve this in a way which is safe in parallel environment I've been using a combination of MPI_COMM_SPAWN and a blocking send. Here are some example parent and child programs for my implementation (the code is in Fortran 90, but syntax would be similar for a C program):

parent.f90:

parent.f90:

program parent

    include 'mpif.h'

    !usual mpi variables                                                                                                
    integer                        :: size, rank, ierr
    integer                        :: status(MPI_STATUS_SIZE)

    integer MPI_COMM_CHILD, ri
    integer tag
    character *128 message

    call MPI_Init(ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

    write(*, *) "I am parent on rank", rank, "of", size                                                 

    call MPI_COMM_SPAWN('./child', MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, &
        MPI_COMM_SELF, MPI_COMM_CHILD, MPI_ERRCODES_IGNORE, ierr)

    write(*, *) "Parent", MPI_COMM_SELF, "child comm", MPI_COMM_CHILD

    tag = 1
    call MPI_RECV(message, 128, MPI_CHARACTER, 0, tag, MPI_COMM_CHILD,&
                  status, ierr)
    write(*, *) "Parent", MPI_COMM_SELF, "child comm", MPI_COMM_CHILD,&
                "!!!"//trim(message)//"!!!"

    call mpi_barrier(mpi_comm_world, ierr)
    call MPI_Finalize(ierr)

end program parent

child.f90:

child.f90:

program child

  include 'mpif.h'

  !usual mpi variables                                                                                                
  integer                        :: size, rank, ierr, parent
  integer                        :: status(MPI_STATUS_SIZE)

  integer MPI_COMM_PARENT, psize, prank
  integer tag
  character *128 message

  call MPI_init(ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

  call MPI_Comm_get_parent(MPI_COMM_PARENT)
  call MPI_Comm_size(MPI_COMM_PARENT, psize, ierr)
  call MPI_Comm_rank(MPI_COMM_PARENT, prank, ierr)

  write(*, *) "I am child on rank", rank, "of", size, "with comm",&
              MPI_COMM_WORLD, "and parent", MPI_COMM_PARENT,&
              psize, prank

  tag = 1
  message = 'Hello Mom and/or Dad!'
  call MPI_SEND(message, 128, MPI_CHARACTER, 0, tag, MPI_COMM_PARENT, ierr)

  call mpi_barrier(MPI_COMM_WORLD, ierr)
  call MPI_Finalize(ierr)

end program child

在使用ifort 16.0.3和intel openmpi 1.10.3进行编译并运行(例如)mpirun -np 4 ./parent之后,我得到以下输出:

After compiling with ifort 16.0.3 and intel openmpi 1.10.3, and running with (for example) mpirun -np 4 ./parent, I get the following output:

 I am parent on rank           0 of           4
 I am parent on rank           1 of           4
 I am parent on rank           2 of           4
 I am parent on rank           3 of           4
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!

这本质上是我想要的行为.据我了解,通过使用maxprocs=1root=0MPI_COMM_SELF作为父级沟通者,我告诉每个父级进程产生1个仅了解其父级的子级,因为它是root=0( (并且仅用于处理)MPI_COMM_SELF范围.然后,我要求它等待其子进程发出的消息.子级获取父级的(SELF)通信器,并将其消息发送到root=0,后者只能是父级.所以这一切都很好.

This is essentially the behaviour that I want. From what I understand, by using maxprocs=1, root=0, and MPI_COMM_SELF as the parent communicator, I'm telling each parent process to spawn 1 child which only knows about its parent, since it is the root=0 (and only process) of the MPI_COMM_SELF scope. Then I ask it to wait for a message from its child process. The child gets the parent's (SELF) communicator and sends its message to root=0 which can only be the parent. So this all works fine.

我希望每个进程都能在其自己的节点上生成其子级.我运行的mpi进程数等于节点数,并且当我调用mpirun时,我使用标志--map-by node来确保每个节点一个进程.我希望子进程以某种方式继承它,否则不知道存在任何其他节点.但是我看到的行为是无法预测的,某些进程分散在节点上,而其他节点(尤其是主mpi进程的root=0)堆积在它们上面.

I was hoping that each process would spawn its child on its own node. I run with the number mpi processes equal to the number of nodes, and when I make my call to mpirun I use the flag --map-by node to ensure one process per node. I was hoping that the child process would in some way inherit that, or else not know that any other nodes exist. But the behaviour I'm seeing is very unpredictable, some processes get spread across nodes while other nodes (notably root=0 of the main mpi process) get many piling up on them.

是否有某种方法可以确保将流程绑定到父流程的节点?也许通过MPI_Info选项可以传递给MPI_COMM_SPAWN?

Is there some way to ensure binding of the processes to the nodes of the parent process? Maybe through the MPI_Info option that I can pass to MPI_COMM_SPAWN?

推荐答案

Open MPI中的每个MPI作业都从分布在一个或多个主机上的一组插槽开始.这些插槽既被初始MPI进程占用,也被子MPI作业的一部分所产生的任何进程占用.在您的情况下,可以在类似于以下内容的主机文件中提供主机:

Each MPI job in Open MPI starts with some set of slots distributed over one or more hosts. Those slots are consumed by both the initial MPI processes and by any process spawned as part of a child MPI job. In your case, the hosts could be provided in a hostfile similar to this:

host1 slots=2 max_slots=2
host2 slots=2 max_slots=2
host3 slots=2 max_slots=2
...

slots=2 max_slots=2将Open MPI限制为每个主机仅运行两个进程.

slots=2 max_slots=2 restricts Open MPI to running only two processes per host.

初始作业启动应为每个主机指定一个进程,否则MPI将使用父作业中的进程填充所有插槽. --map-by ppr:1:node做到了:

The initial job launch should specify one process per host, otherwise MPI will fill up all slots with processes from the parent job. --map-by ppr:1:node does the trick:

mpiexec --hostfile hosts --map-by ppr:1:node ./parent

现在,问题在于随着新的子作业产生,Open MPI将继续以先到先得的方式填充插槽,因此无法保证子进程将在与其父进程相同的主机上启动.要执行此操作,请按照Gilles Gouaillardet的建议进行设置,将info参数的host键设置为MPI_Get_processor_name返回的主机名:

Now, the problem is that Open MPI will continue filling the slots on a first come first served basis as new child jobs are spawned, therefore there is no guarantee that the child process will be started on the same host as its parent. To enforce this, set as advised by Gilles Gouaillardet the host key of the info argument to the hostname as returned by MPI_Get_processor_name:

character(len=MPI_MAX_PROCESSOR_NAME) :: procn
integer :: procl
integer :: info

call MPI_Get_processor_name(procn, procl, ierr)

call MPI_Info_create(info, ierr)
call MPI_Info_set(info, 'host', trim(procn), ierr)

call MPI_Comm_spawn('./child', MPI_ARGV_NULL, 1, info, 0, &
...

您的MPI作业可能会中止并显示以下消息:

It is possible that your MPI jobs abort with the following message:

--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

这基本上意味着请求的主机 已满(所有插槽已填满)该主机不在原始主机列表中,因此未在其上分配插槽它.前者显然不是这种情况,因为主机文件为每个主机列出了两个插槽,而父作业仅使用一个插槽. host键值对中提供的主机名必须与主机初始列表中的条目完全匹配.通常情况下,主机文件仅包含不合格的主机名,如第一段中的示例主机文件中所示,而如果设置了域部分,例如node1.some.domain.localnode2.some.domain.local等,则MPI_Get_processor_name返回FQDN.解决方案是在主机文件中使用FQDN:

It basically means that the requested host is either full (all slots already filled) or the host is not in the original host list and therefore no slots were allocated on it. The former is obviously not the case since the hostfile lists two slots per host and the parent job only uses one. The hostname as provided in the host key-value pair must match exactly the entry in the initial list of hosts. It is often the case that the hostfile contains only unqualified host names, like in the sample hostfile in the first paragraph, while MPI_Get_processor_name returns the FQDN if the domain part is set, e.g., node1.some.domain.local, node2.some.domain.local, etc. The solution is to use FQDNs in the hostfile:

host1.example.local slots=2 max_slots=2
host2.example.local slots=2 max_slots=2
host3.example.local slots=2 max_slots=2
...

如果分配是由资源管理器(例如SLURM)提供的,则解决方案是将结果从MPI_Get_processor_name转换为与RM提供的内容匹配.

If the allocation is instead provided by a resource manager such as SLURM, the solution is to transform the result from MPI_Get_processor_name to match what the RM provides.

请注意, MPI_Comm_spawn 列出add-host键,该键应将值中的主机名添加到作业的主机列表中:

Note that the man page for MPI_Comm_spawn lists the add-host key, which is supposed to add the hostname in the value to the list of hosts for the job:

add-host               char *   Add the specified host to the list of
                                hosts known to this job and use it for
                                the associated process. This will be
                                used similarly to the -host option.

以我的经验,这从来没有奏效(使用Open MPI最高1.10.4进行了测试).

In my experience, this has never worked (tested with Open MPI up to 1.10.4).

这篇关于MPI_COMM_SPAWN的控制节点映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆