为什么mpi_bcast比mpi_reduce慢得多? [英] why is mpi_bcast so much slower than mpi_reduce?

查看:92
本文介绍了为什么mpi_bcast比mpi_reduce慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用MPI,我们可以进行广播以将阵列发送到多个节点,也可以简化将多个节点上的阵列组合到一个节点上.

Using MPI, we can do a broadcast to send an array to many nodes, or a reduce to combine arrays from many nodes onto one node.

我猜想,实现这些目标的最快方法将是使用二叉树,其中每个节点要么发送到两个节点(bcast),要么减少两个节点(reduce),这将使节点数成为时间的对数

I guess that the fastest way to implement these will be using a binary tree, where each node either sends to two nodes (bcast) or reduces over two nodes (reduce), which will give a time logarithmic in the number of nodes.

似乎没有什么理由会比减少广播特别慢?

There doesn't seem to be any reason for which broadcast would be particularly slower than reduce?

我在4台计算机的群集上运行了以下测试程序,其中每台计算机具有12个内核.奇怪的是,广播比减少慢很多.为什么?我有什么可以做的吗?

I ran the following test program on a 4-computer cluster, where each computer has 12 cores. The strange thing is that broadcast was quite a lot slower than reduce. Why? Is there anything I can do about it?

结果是:

inited mpi: 0.472943 seconds
N: 200000 1.52588MB
P = 48
did alloc: 0.000147641 seconds
bcast: 0.349956 seconds
reduce: 0.0478526 seconds
bcast: 0.369131 seconds
reduce: 0.0472673 seconds
bcast: 0.516606 seconds
reduce: 0.0448555 seconds

代码是:

#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <sys/time.h>
using namespace std;

#include <mpi.h>

class NanoTimer {
public:
   struct timespec start;

   NanoTimer() {
      clock_gettime(CLOCK_MONOTONIC,  &start);

   }
   double elapsedSeconds() {
      struct timespec now;
      clock_gettime(CLOCK_MONOTONIC,  &now);
      double time = (now.tv_sec - start.tv_sec) + (double) (now.tv_nsec - start.tv_nsec) * 1e-9;
      start = now;
      return time;
   }
    void toc(string label) {
        double elapsed = elapsedSeconds();
        cout << label << ": " << elapsed << " seconds" << endl;        
    }
};

int main( int argc, char *argv[] ) {
    if( argc < 2 ) {
        cout << "Usage: " << argv[0] << " [N]" << endl;
        return -1;
    }
    int N = atoi( argv[1] );

    NanoTimer timer;

    MPI_Init( &argc, &argv );
    int p, P;
    MPI_Comm_rank( MPI_COMM_WORLD, &p );
    MPI_Comm_size( MPI_COMM_WORLD, &P );
    MPI_Barrier(MPI_COMM_WORLD);
    if( p == 0 ) timer.toc("inited mpi");
    if( p == 0 ) {
        cout << "N: " << N << " " << (N*sizeof(double)/1024.0/1024) << "MB" << endl;
        cout << "P = " << P << endl;
    }
    double *src = new double[N];
    double *dst = new double[N];
    MPI_Barrier(MPI_COMM_WORLD);
    if( p == 0 ) timer.toc("did alloc");

    for( int it = 0; it < 3; it++ ) {    
        MPI_Bcast( src, N, MPI_DOUBLE, 0, MPI_COMM_WORLD );    
        MPI_Barrier(MPI_COMM_WORLD);
        if( p == 0 ) timer.toc("bcast");

        MPI_Reduce( src, dst, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
        MPI_Barrier(MPI_COMM_WORLD);
        if( p == 0 ) timer.toc("reduce");
    }

    delete[] src;

    MPI_Finalize();
    return 0;
}

集群节点正在运行64位ubuntu 12.04.我同时尝试了openmpi和mpich2,并获得了非常相似的结果.该网络不是千兆以太网,它不是最快的,但我最好奇的不是绝对速度,而是广播和缩减之间的差距.

The cluster nodes were running 64-bit ubuntu 12.04. I tried both openmpi and mpich2, and got very similar results. The network is gigabit ethernet, which is not the fastest, but what I'm most curious about is not the absolute speed, so much as the disparity between broadcast and reduce.

推荐答案

我认为这不能完全回答您的问题,但希望它能提供一些见识.

I don't think this quite answers your question, but I hope it provides some insight.

MPI只是一个标准.它没有定义每个函数的实现方式.因此,MPI中某些任务的性能(在您的情况下为MPI_Bcast和MPI_Reduce)完全取决于您使用的实现.可以使用点对点通信方法设计广播,使其性能比给定的MPI_Bcast更好.

MPI is just a standard. It doesn't define how every function should be implemented. Therefore the performance of certain tasks in MPI (in your case MPI_Bcast and MPI_Reduce) are based strictly on the implementation you are using. It is possible that you could design a broadcast using point-to-point communication methods that performs better than the given MPI_Bcast.

无论如何,您必须考虑这些功能中的每个功能.广播是从一个过程中获取信息,然后将其发送给所有其他过程.reduce是从每个过程中获取信息并将其简化为一个过程.根据最新的标准,MPI_Bcast为MPI_Reduce被认为是一对一的集合操作,而MPI_Reduce被认为是一对一的集合操作.因此,在两种实现中都可能找到有关将二进制树用于MPI_Reduce的直觉.但是,它很可能在MPI_Bcast中找不到.可能是这样的情况,MPI_Bcast是使用非阻塞点对点通信(从包含信息的过程发送到所有其他过程)并在通信后全部等待来实现的.无论如何,为了弄清楚这两个函数是如何工作的,建议您深入研究一下您的OpenMPI和MPICH2实现的源代码.

Anyways, you have to consider what each of these functions is doing. Broadcast is taking information from one process and sending it to all other processes; reduce is taking information from each process and reducing it onto one process. According to the (most recent) standard, MPI_Bcast is considered a One-to-All collective operation and MPI_Reduce is considered an All-to-One collective operation. Therefore your intuition about using binary trees for MPI_Reduce is probably found in both implementations. However, it most likely not found in MPI_Bcast. It might be the case that MPI_Bcast is implemented using non-blocking point-to-point communication (sending from the process containing the information to all other processes) with a wait-all after the communication. In any case, in order to figure out how both functions work, I would suggest delving into the source code of your implementations of OpenMPI and MPICH2.

这篇关于为什么mpi_bcast比mpi_reduce慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆