MPICH/OpenMPI中的容错 [英] fault tolerance in MPICH/OpenMPI

查看:141
本文介绍了MPICH/OpenMPI中的容错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个问题-

第一季度.除了检查点/回滚之外,还有没有更有效的方法来处理MPI中的错误情况?我看到,如果节点死亡",程序将突然停止.在节点死亡之后,有什么方法可以继续执行吗? (如果以准确性为代价,则没有问题)

Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node "dies", the program halts abruptly.. Is there any way to go ahead with the execution after a node dies ?? (no issues if it is at the cost of accuracy)

第二季度.我在"http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation"中看到,OpenMPI具有更好的容错能力,最近MPICH-2也具有类似的功能.有人知道它们是什么以及如何使用它们吗?是模式"吗?他们可以为第一季度中所述的情况提供帮助吗?

Q2. I read in "http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation", that OpenMPI has better fault tolerance and recently MPICH-2 has also come up with similar features.. does anybody know what they are and how to use them? is it a "mode"? can they help in the situation stated in Q1 ?

请回复.谢谢.

推荐答案

MPI-所有实现-都能够在发生错误一段时间后继续运行.默认设置为死-即默认错误处理程序为MPI_ERRORS_ARE_FATAL-但可以设置(例如,参见讨论此处).但是目前标准还远远没有超出此范围.也就是说,发生此类错误后很难恢复并继续.如果您的程序足够简单-某种类型的设置,则可能会继续这种方式.

MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.

MPI论坛目前正在研究将成为MPI-3的内容以及错误处理和故障公差将是新标准的重要组成部分(有一个工作组专门用于该主题).但是,在完成这项工作之前,要从MPI中获得更强的容错能力,唯一的方法就是使用更早的非标准扩展. FT-MPI 是一个开发了非常强大的MPI的项目,但不幸的是,它基于MPI1. 2;该标准的早期版本. 此处的说法是他们现在正在使用OpenMPI,但我不知道不知道会发生什么.有一个基于MPI2的 MPICH-V ,但它比我想的要多基于检查点重启正在寻找.

The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.

已更新为添加:容错能力并未纳入MPI-3,但是工作组继续开展工作,并期望不久后会由此产生结果.

Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.

这篇关于MPICH/OpenMPI中的容错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆