OPENMP F90 / 95嵌套DO循环 - 与串行实现相比,问题得到改善 [英] OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

查看:204
本文介绍了OPENMP F90 / 95嵌套DO循环 - 与串行实现相比,问题得到改善的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了一些搜索,但找不到与我的问题有关的任何内容(对不起,如果我的问题是多余的!)。无论如何,正如标题所述,我无法改进我的代码的串行实现。我需要并行化的代码片段如下所示(这是使用OpenMP的Fortran90):

  do n = 1,lm 
do m = 1,jm
do l = 1,im
sum_u = 0
sum_v = 0
sum_t = 0
do k = 1,lm
!$ omp parallel do reduction(+:sum_u,sum_v,sum_t)
do j = 1,jm
do i = 1,im
exp_smoother = exp( - (abs (abs)/ hzscl) - (abs(kn)/ vscl))
sum_u = sum_u + v_p(i,j,k)* exp_smoother
sum_t = sum_t + t_p(i,j,k)* exp_smoother

sum_u_pert(l,m,n)= sum_u
sum_v_pert(l,m,n)= sum_v
sum_t_pert(l,m,n)= sum_t

end do
end do
end do
end do
end do
end do

我是否处于竞争状态问题是什么?或者我只是把指令放在错误的地方?我很新,所以我很抱歉,如果这是一个过分简单化的问题。



无论如何,没有并行化,代码是非常慢。为了给出问题的大小,lm,jm和im索引分别是60,401和501。所以并行化至关重要。任何帮助或链接到有用的资源将非常感激!我使用xlf来编译上面的代码,如果这是有用的。



谢谢!
-Jen

解决方案

将omp编译指示放置在外部循环处是显而易见的。



对于每个(l,m,n),计算扰动变量与指数平滑器之间的卷积。每个(l,m,n)计算与其他计算完全无关,因此您可以将其放在最外面的循环中。例如,最简单的事情是:

$ $ $ $ $ $ $ $ $ $ $!$ $ omp parallel do private(n,m,l,i,j,k,exp_smoother )共享(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p),默认(无)
do n = 1,lm
do m = 1,jm
do l = 1,im
do k = 1,lm
do j = 1,jm
do i = 1,im
exp_smoother = exp( - (abs(il)/ hzscl) - (abs (jm)/ hzscl) - (abs(kn)/ vscl))
sum_u_pert(l,m,n)= sum_u_pert(l,m,n)+ u_p(i,j,k)* exp_smoother
sum_v_pert(l,m,n)= sum_v_pert(l,m,n)+ v_p(i,j,k)* exp_smoother
sum_t_pert(l,m,n)= sum_t_pert )+ t_p(i,j,k)* exp_smoother
end do
end do
end do
end do
end do
end do

使我在8个内核上运行速度提高了6倍(使用的是20x41x41的小问题)。考虑到在循环中要做的工作量,即使在较小的规模上,我也认为它不是8倍加速的原因涉及内存竞争或虚假共享;为了进一步的性能调优,你可能希望明确地将sum数组分割成每个线程的子块,并在最后将它们组合起来;但根据问题的大小,可能不需要额外的im x jm x lm大小的数组。



好像这里有很多结构问题你可以通过explot来加快系列案例的速度,但是要找到它就容易多了;在笔和纸上玩耍几分钟后就没有想到,但有人更聪明,可能会发现一些东西。


I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):

do n=1,lm     
  do m=1,jm   
    do l=1,im      
      sum_u = 0
      sum_v = 0
      sum_t = 0
      do k=1,lm
       !$omp parallel do reduction (+:sum_u,sum_v,sum_t) 
        do j=1,jm  
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u = sum_u + u_p(i,j,k) * exp_smoother
            sum_v = sum_v + v_p(i,j,k) * exp_smoother
            sum_t = sum_t + t_p(i,j,k) * exp_smoother

            sum_u_pert(l,m,n) = sum_u
            sum_v_pert(l,m,n) = sum_v
            sum_t_pert(l,m,n) = sum_t          

            end do
          end do
       end do      
    end do
  end do  
end do

Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.

Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.

Thanks! -Jen

解决方案

The obvious place to put the omp pragma is at the very outside loop.

For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing

!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
  do m=1,jm
    do l=1,im
      do k=1,lm
        do j=1,jm
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
            sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
            sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
          end do
        end do
      end do
    end do
  end do
end do

gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.

It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.

这篇关于OPENMP F90 / 95嵌套DO循环 - 与串行实现相比,问题得到改善的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆