MPI_Reduce 与 (MPI_Gather + Reduction on Root) 的性能对比 [英] Performance of MPI_Reduce vs (MPI_Gather + Reduction on Root)

查看:108
本文介绍了MPI_Reduce 与 (MPI_Gather + Reduction on Root) 的性能对比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 MPICH2 库的 CRAY 超级计算机.每个节点有 32 个 CPU.

CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.

我在 N 个不同的 MPI 等级上有一个浮动,其中每个等级都在不同的节点上.我需要对这组浮点数执行归约操作.对于任何 N 值,我想知道 MPI_Reduce 是否比 MPI_Gather 更快,并且在根上计算了减少.请假设对根等级进行的减少将使用可以利用 N 个线程的良好并行减少算法来完成.

I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.

如果 N 的任何值都不是更快,那么对于较小的 N(例如 16)或较大的 N,它是否会趋于正确?

If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?

如果是真的,为什么?(例如,MPI_Reduce 是否会使用树通信模式,在它用于与树的下一级通信的方法中倾向于隐藏缩减操作的时间?)

If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)

推荐答案

假设 MPI_Reduce 总是比 MPI_Gather + local reduce 快.

Assume that MPI_Reduce is always faster than MPI_Gather + local reduce.

即使在 N 的情况下,reduce 比gather 慢,MPI 实现也可以轻松地在这种情况下通过gather + local reduce 实现reduce.

Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.

MPI_Reduce 只比 MPI_Gather + local reduce 有优势:

MPI_Reduce has only advantages over MPI_Gather + local reduce:

  1. MPI_Reduce 是更高级的操作,为实现提供更多优化机会.
  2. MPI_Reduce 需要分配更少的内存
  3. MPI_Reduce 需要通过同一链接传递更少的数据(如果使用树)或更少的数据(如果使用直接多对一)
  4. MPI_Reduce 可以将计算分配到更多资源(例如使用树通信模式)
  1. MPI_Reduce is the more high-level operation giving the implementation more opportunity to optimize.
  2. MPI_Reduce needs to allocate much less memory
  3. MPI_Reduce needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)
  4. MPI_Reduce can distribute the computation across more resources (e.g. using a tree communication pattern)

那是说:永远不要对性能做任何假设.测量.

That said: Never assume anything about performance. Measure.

这篇关于MPI_Reduce 与 (MPI_Gather + Reduction on Root) 的性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆