不知道在openmp循环中应该是SHARED还是PRIVATE [英] not sure what should be SHARED or PRIVATE in openmp loop

查看:1220
本文介绍了不知道在openmp循环中应该是SHARED还是PRIVATE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个更新矩阵A的循环,我想让它成为openmp,但我不确定哪些变量应该共享和私有。我会认为只是ii和jj会起作用,但事实并非如此。我想我还需要一个!$ OMP ATOMIC UPDATE ...



循环只计算N和N-1粒子之间的距离并更新矩阵A.

 !$ OMP并行执行私有(ii,jj)
执行ii = 1,N-1
(b)距离= DSQRT(距离2)$(b)距离=距离=距离=距离=总和(distance_vector * distance_vector)
Distance_vector = X b(b)b coff =距离*距离*距离
PE = PE-M(II)* M(JJ)/距离
A(jj,:)= A(jj,:)+(M ii)/ coff)*(distance_vector)
A(ii,:)= A(ii,:) - (M(jj)/ coff)*(distance_vector)
end do
end do
!$ OMP END PARALLEL DO


解决方案

OpenMP的黄金法则是所有变量s(有一些排除)在外部范围中定义,在并行区域中默认共享。由于在2008年以前的Fortran中没有本地范围(即在早期版本中没有 BLOCK ... END BLOCK ),所有变量(除 threadprivate ones)是共享的,这对我来说是非常自然的(与Ian Bush不同,我不是使用 default(none)的粉丝,然后



以下是如何确定每个变量的共享等级:




  • N - 共享,因为它应该在所有线程中相同,并且只读取其值。 li>
  • ii - 它是循环的计数器,受工作共享指令限制,所以它的共享类预定为私有。在 PRIVATE 子句中显式声明它并没有什么坏处,但这并不是真的有必要。
  • jj - 循环的循环计数器,不受工作共享指令的约束,因此 jj 应该是 private

  • X - 共享,因为所有线程都引用并只读取它。 b $ b
  • distance_vector - 显然应该是 private ,因为每个线程都在不同的粒子对上工作。 li>
  • 距离距离2 coff X 相同的原因,应该共享
  • M code>。

  • PE - 充当累加器变量(我猜这是系统的势能)和应该是减少操作的主题,即应放在 REDUCTION(+:....)子句中。

  • A - 这个很棘手。它可以共享并更新到 A(jj,:)受同步构造保护,或者可以使用reduction(OpenMP允许减少Fortran中的数组变量,与C / C ++)。 A(ii,:)永远不会被多于一个线程修改,因此它不需要特殊处理。



减少 A 到位时,每个线程将获得它的私人副本 A 这可能是一个记忆猪,但我怀疑你会用这个直接的O(N <2> )模拟代码来计算具有非常大数量的粒子的系统。减少实施还有一定的开销。在这种情况下,您只需将 A 添加到 REDUCTION(+:...)子句的列表中。

通过同步构造,您有两种选择。您可以使用 ATOMIC 构造或 CRITICAL 构造。因为 ATOMIC 只适用于标量上下文,所以你必须分散赋值循环并对每个应用 ATOMIC 语句,例如:

 !$ OMP原子更新
A(jj,1)= A(jj,1 )+(M(ii)/ coff)*(distance_vector(1))
!$ OMP原子更新
A(jj,2)= A(jj,2)+(M(ii)/ (距离矢量(2))*(distance_vector(2))
!$ OMP原子更新
A(jj,3)= A(jj,3)+(M(ii)/ coff)* )

您也可以将其重写为循环 - 不要忘记声明循环计数器



通过 CRITICAL ,不需要将该循环:

 !$ OMP CRITICAL(forceloop)
A(jj,:)= A(jj,:)+ (M(ii)/ coff)*(distance_vector)
!$ OMP END CRITICAL(forceloop)



命名关键区域是可选的,并且在这种特殊情况下有点不必要,但通常它可以分隔不相关的关键区域。



哪个更快?用 ATOMIC CRITICAL 展开这取决于很多事情。通常 CRITICAL 的速度较慢,因为它通常涉及对OpenMP运行时的函数调用,而原子增量(至少在x86上)是通过锁定添加指令实现的。正如他们经常说的,YMMV。



要重述一下,你的循环的工作版应该是这样的:

 !$ OMP PARALLEL DO PRIVATE(jj,kk,distance_vector,distance2,distance,coff)& 
!$ OMP&还原(+:PE)
do ii = 1,N-1
do jj = ii + 1,N
distance_vector = X(ii,:) - X(jj,:)
distance2 = sum(distance_vector * distance_vector)
distance = DSQRT(distance2)
coff = distance * distance * distance
PE = PE-M(II)* M(JJ)/距离
do kk = 1,3
!$ OMP原子更新
A(jj,kk)= A(jj,kk)+(M(ii)/ coff)*(distance_vector k))
end do
A(ii,:)= A(ii,:) - (M(jj)/ coff)*(distance_vector)
end do
end do
!$ OMP END PARALLEL DO

我假设你的系统是3-




有了这一切,我第二次伊恩布什,你需要重新思考位置和加速度矩阵如何布局在记忆中。适当的缓存使用可以提高你的代码,并且还可以允许某些操作,例如 X(:,ii)-X(:,jj)被矢量化,即使用矢量SIMD指令实现。

I have a loop which updates a matrix A and I want to make it openmp but I'm not sure what variables should be shared and private. I would have thought just ii and jj would work but it doesn't. I think I need an !$OMP ATOMIC UPDATE somewhere too...

The loop just calculates the distance between N and N-1 particles and updates a matrix A.

            !$OMP PARALLEL DO PRIVATE(ii,jj)
            do ii=1,N-1
                    do jj=ii+1,N
                            distance_vector=X(ii,:)-X(jj,:)
                            distance2=sum(distance_vector*distance_vector)
                            distance=DSQRT(distance2)
                            coff=distance*distance*distance
                            PE=PE-M(II)*M(JJ)/distance
                            A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)
                            A(ii,:)=A(ii,:)-(M(jj)/coff)*(distance_vector)
                    end do
            end do
            !$OMP END PARALLEL DO

解决方案

The golden rule of OpenMP is that all variables (with some exclusions), that are defined in an outer scope, are shared by default in the parallel region. Since in Fortran before 2008 there are no local scopes (i.e. there is no BLOCK ... END BLOCK in earlier versions), all variables (except threadprivate ones) are shared, which is very natural for me (unlike Ian Bush, I am not a big fan of using default(none) and then redeclaring the visibility of all 100+ local variables in various complex scientific codes).

Here is how to determine the sharing class of each variable:

  • N - shared, because it should be the same in all threads and they only read its value.
  • ii - it is the counter of loop, subject to a worksharing directive, so its sharing class is predetermined to be private. It doesn't hurt to explicitly declare it in a PRIVATE clause, but that is not really necessary.
  • jj - loop counter of a loop, which is not subject to a worksharing directive, hence jj should be private.
  • X - shared, because all threads reference and only read from it.
  • distance_vector - obviously should be private as each thread works on different pairs of particles.
  • distance, distance2, and coff - ditto.
  • M - should be shared for the same reasons as X.
  • PE - acts as an accumulator variable (I guess this is the potential energy of the system) and should be a subject of an reduction operation, i.e. should be put in a REDUCTION(+:....) clause.
  • A - this one is tricky. It could be either shared and updates to A(jj,:) protected with synchronising constructs, or you could use reduction (OpenMP allows reductions over array variables in Fortran unlike in C/C++). A(ii,:) is never modified by more than one thread so it does not need special treatment.

With reduction over A in place, each thread would get its private copy of A and this could be a memory hog, although I doubt you would use this direct O(N2) simulation code to compute systems with very large number of particles. There is also a certain overhead associated with the reduction implementation. In this case you simply need to add A to the list of the REDUCTION(+:...) clause.

With synchronising constructs you have two options. You could either use the ATOMIC construct or the CRITICAL construct. As ATOMIC is only applicable to scalar contexts, you would have to "unvectorise" the assignment loop and apply ATOMIC to each statement separately, e.g.:

!$OMP ATOMIC UPDATE
A(jj,1)=A(jj,1)+(M(ii)/coff)*(distance_vector(1))
!$OMP ATOMIC UPDATE
A(jj,2)=A(jj,2)+(M(ii)/coff)*(distance_vector(2))
!$OMP ATOMIC UPDATE
A(jj,3)=A(jj,3)+(M(ii)/coff)*(distance_vector(3))

You may also rewrite this as a loop - do not forget to declare the loop counter private.

With CRITICAL there is no need to unvectorise the loop:

!$OMP CRITICAL (forceloop)
A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)
!$OMP END CRITICAL (forceloop)

Naming critical regions is optional and a bit unnecessary in this particular case but in general it allows to separate unrelated critical regions.

Which is faster? Unrolled with ATOMIC or CRITICAL? It depends on many things. Usually CRITICAL is way slower since it often involves function calls to the OpenMP runtime while atomic increments, at least on x86, are implemented with locked addition instructions. As they often say, YMMV.

To recapitulate, a working version of your loop should be something like:

!$OMP PARALLEL DO PRIVATE(jj,kk,distance_vector,distance2,distance,coff) &
!$OMP& REDUCTION(+:PE)
do ii=1,N-1
   do jj=ii+1,N
      distance_vector=X(ii,:)-X(jj,:)
      distance2=sum(distance_vector*distance_vector)
      distance=DSQRT(distance2)
      coff=distance*distance*distance
      PE=PE-M(II)*M(JJ)/distance
      do kk=1,3
         !$OMP ATOMIC UPDATE
         A(jj,kk)=A(jj,kk)+(M(ii)/coff)*(distance_vector(kk))
      end do
      A(ii,:)=A(ii,:)-(M(jj)/coff)*(distance_vector)
   end do
end do
!$OMP END PARALLEL DO

I've assumed that your system is 3-dimensional.


With all this said, I second Ian Bush that you need to rethink how position and acceleration matrices are laid out in memory. Proper cache usage could boost your code and would also allow for certain operations, e.g. X(:,ii)-X(:,jj) to be vectorised, i.e. implemented using vector SIMD instructions.

这篇关于不知道在openmp循环中应该是SHARED还是PRIVATE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆