MPI_Allreduce总和中的混合元素 [英] MPI_Allreduce mix elements in the sum

查看:205
本文介绍了MPI_Allreduce总和中的混合元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在并行处理一个fortran代码,该代码在无MPI版本中可以正常工作.以下是代码摘录.

I am parallelising a fortran code which works with no problem in a no-MPI version. Below is an excerpt of the code.

每个处理器都执行以下操作:

  • 对于一定数量的粒子,它在"do 203"循环中演化出一定数量;在划分为Nint个子间隔(j = 1,Nint)的给定间隔中,每个处理器都会产生矢量Nx1(j),Nx2(j)的元素.
  • 然后,向量Nx1(j),Nx2(j)发送到根(mype = 0),根在每个子间隔(j = 1,Nint)中对每个处理器的所有贡献求和:Nx1(j)从处理器1 +处理器2的Nx1(j)....j的每个值(每个子间隔)的根之和,并生成Nx5(j),Nx6(j).
  • For a certain number of particles it evolves certain quantities in the loop "do 203"; in a given interval which is divided in Nint subintervals (j=1,Nint), every processor produces an element of the vectors Nx1(j), Nx2(j).
  • Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).

另一个问题是,如果我取消分配变量,则代码在计算结束后仍处于待机状态,而没有完成执行.但我不知道这是否与MPI_Allreduce问题有关.

Another issue is that if I deallocate the variables the code remains in standby after the end of the calculation without completing the execution; but I don't know if this is related to the MPI_Allreduce issue.


    include "mpif.h"
    ...
    integer*4 ....
    ...
    real*8 
    ...
    call MPI_INIT(mpierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, npe, mpierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, mype, mpierr)

!       Allocate variables
    allocate(Nx1(Nint),Nx5(Nint))
    ...

!       Parameters
    ...

    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

!   Loop on particles

    do 100 npartj=1,npart_local

     call init_random_seed() 
     call random_number (rand)

    ...
    Initial condition
    ... 
    do 203 i=1,1000000  ! loop for time evolution of single particle

        if(ufinp.gt.p1.and.ufinp.le.p2)then 
         do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx1(j)=Nx1(j)+1 
          endif 
         enddo
        endif
       if(ufinp.gt.p2.and.ufinp.le.p3)then 
        do j=1,Nint  ! spatial position at any momentum
          ls(j) = lb+(j-1)*Delta/Nint !Left side of sub-interval across shock
          rs(j) = ls(j)+Delta/Nint
          if(y(1).gt.ls(j).and.y(1).lt.rs(j))then !position-ordered
            Nx2(j)=Nx2(j)+1 
          endif 
        enddo
       endif
203  continue 
100    continue     
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)

    print*,"To be summed"
    do j=1,Nint
       call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum,
     &      MPI_COMM_WORLD, mpierr)
           call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum,
     &          MPI_COMM_WORLD, mpierr)
     enddo 

    if(mype.eq.0)then
     do j=1,Nint
       write(1,107)ls(j),Nx5(j),Nx6(j)
     enddo 
107  format(3(F13.2,2X,i6,2X,i6))   
    endif 
    call MPI_Barrier (MPI_COMM_WORLD, mpierr)
    print*,"Now deallocate"
!   deallocate(Nx1)  !inserting the de-allocate
!   deallocate(Nx2)

    close(1)

    call MPI_Finalize(mpierr)

    end


!  Subroutines
    ...

推荐答案

然后,将向量Nx1(j),Nx2(j)发送到根(mype = 0),根在每个子间隔(j = 1,Nint)中对每个处理器的所有贡献求和:Nx1(j)从处理器1 +处理器2的Nx1(j)....j的每个值(每个子间隔)的根之和,并生成Nx5(j),Nx6(j).

Then, the vectors Nx1(j), Nx2(j) are sent to the root (mype =0) which in every subinterval (j=1,Nint) sums all the contributions for every processor: Nx1(j) from processor 1 + Nx1(j) from processor 2.... The root sums for every value of j (every subinterval), and produces Nx5(j), Nx6(j).

这不是减少过敏的作用.精简意味着在所有过程中并行进行求和. allreduce意味着所有进程都将获得求和的结果.

This is not what an allreduce does. Reduction means the summation is done in parallel across all processes. allreduce means all processes will get the result of the summing.

您的MPI_Allreduces:

Your MPI_Allreduces:

   call MPI_ALLREDUCE (Nx1(j),Nx5(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2(j),Nx6(j),npe,mpi_integer,mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

实际上看起来这里的计数应该是1.这是因为count只是说明您将从每个流程中接收多少元素,而不是总数.

Actually look like the count should be 1 here. This is because count just states how many elements you are to receive from each process, not how many there will be in total.

但是,您实际上并不需要该循环,因为幸运的是,allreduce能够一次处理多个元素.因此,我相信您实际上希望的不是这样的循环:

However, you actually do not need that loop, because the allreduce luckily is capable of handling multiple elements all at once. Thus, I believe instead of the loop with your allreduces, you actually want something like:

   integer :: Nx1(nint)
   integer :: Nx2(nint)
   integer :: Nx5(nint)
   integer :: Nx6(nint)

   call MPI_ALLREDUCE (Nx1, Nx5, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)
   call MPI_ALLREDUCE (Nx2, Nx6, nint, mpi_integer, mpi_sum, &
     &                 MPI_COMM_WORLD, mpierr)

Nx5将包含所有分区中Nx1的总和,Nx6将包含Nx2中的总和. 您的问题中的信息有些缺乏,因此,如果您要查找的是什么,我不确定.

Nx5 will contain the sum of Nx1 across all partitions, and Nx6 the sum across Nx2. The information in your question is a little bit lacking, so I am not quite sure, if this is what you are looking for.

这篇关于MPI_Allreduce总和中的混合元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆