MPI-程序运行时添加/删除节点 [英] MPI - Add/remove node while program is running

查看:251
本文介绍了MPI-程序运行时添加/删除节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有MPI实施允许在运行时动态添加/删除节点?是否可以从节点的完全硬件故障中恢复过来,从而可以在不重新启动程序的情况下修复并重新启动该节点?

Is there an MPI implementation that allows nodes to be dynamically added/removed at runtime? Do any recover from complete hardware failure of a node, allowing the node to be repaired and relaunched without restarting the program?

推荐答案

是否有MPI实现允许在运行时动态添加/删除节点?

这实际上是两个问题.通常可以在运行时使用像MPI_Comm_spawn这样的调用来动态添加节点.正如@Hristo在评论中指出的那样,您应该在Open MPI中设置正确的信息密钥.在其他实施方式中这也是可能的.至于删除节点,目前这是一个很大的研究领域.当前,大多数MPI实施在节点完全失败后仍具有不同程度的成功.在打开MPI 的当前版本中,我认为不存在针对此类故障的任何支持[引文],尽管有工作正在进行中.在当前版本的 MPICH 中,您可以将标志-disable-auto-cleanup传递给mpiexec,它将不会自动清除进程/节点故障后启动您的应用程序.但是,您仍然必须修改MPI应用程序以处理这种情况. MPICH的各种派生产品(英特尔MPI,Cray MPI,IBM MPI,MVAPICH等)均不支持此功能AFAIK.还可以使用其他研究实现方式来扩展对MPI标准的支持.标准化机构目前正在考虑减轻用户级别的故障,作为让用户处理流程故障的一种方式.在链接的网站上有一个基于Open MPI的研究实现,并且在MPICH的下一版本(3.2)中也将有一个实验原型.

This is actually two questions. Nodes usually can be dynamically added at runtime using calls like MPI_Comm_spawn. As @Hristo pointed out in the comments, you should set the correct info key in Open MPI. It may also be possible in other implementations. As for removing nodes, that's a big area of research at the moment. Most MPI implementations currently have varying levels of success surviving a total node failure. In the current releases of Open MPI, I don't believe there is any support for that sort of failure [citation needed], though there is work to improve that ongoing. In the current version of MPICH, you can pass the flag -disable-auto-cleanup to mpiexec and it will not automatically clean up your application after a process/node failure. However, you'll still have to modify your MPI application to handle this situation. The various derivatives of MPICH (Intel MPI, Cray MPI, IBM MPI, MVAPICH, etc.) all don't support this feature AFAIK. There are other research implementations that are also available to extend the support of the MPI Standard. User Level Failure Mitigation is currently being considered by the standardization body as a way of letting the user handle process failures. There is a research implementation based on Open MPI available at the website linked, and an experimental prototype will also be in the next version of MPICH (3.2).

是否可以从节点的完全硬件故障中恢复,从而可以在不重新启动程序的情况下修复并重新启动该节点?

这基本上与上面的过程相同.您可能需要使用API​​来删除进程,然后以某种方式找出它可用,然后使用spawn将其重新添加.这些调用必须从应用程序内部进行,而不是从外部进行.

This is essentially the same process as above. You would need to use the APIs to remove a process and then somehow find out that it's available and add it back using spawn. These calls have to be made from inside the application though, not externally.

这篇关于MPI-程序运行时添加/删除节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆