使用OpenMP关键和有序 [英] Using OpenMP critical and ordered

查看:173
本文介绍了使用OpenMP关键和有序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Fortran和OpenMP颇为陌生,但我试图让自己的方向发展。我有一段代码用于计算我试图并行化的变异函数。不过,我似乎正在获得比赛条件,因为一些成绩已经下降了千分之一左右。

问题似乎是减少。使用OpenMP减少工作并给出正确的结果,但它们并不可取,因为减少实际发生在另一个子例程中(我将相关行复制到OpenMP循环中进行测试)。因此,我将这些削减放在了CRITICAL部分,但没有成功。有趣的是,这个问题只发生在实数上,而不是整数。我想过添加的顺序是否有所不同,但不应该产生这么大的错误。

为了检查,我把所有的东西放在平行在一个ORDERED块中做,当然这会给出正确的结果(尽管没有任何加速)。我也尝试将所有内容放在关键部分,但由于某些原因,没有给出正确的结果。我的理解是,OpenMP会在进入/退出CRITICAL部分时刷新共享变量,所以不应该有任何缓存问题。

所以我的问题是:为什么不'在这种情况下,关键部分工作?



我的代码如下。除 np tm hm gam 外,所有共享变量都是只读的。



编辑:我试图模拟由多个线程引起的随机性,用相同范围内的随机整数替换do循环(即在循环中生成一对i,j;如果它们是访问,生成新的),令我惊讶的是结果相匹配。然而,经过进一步检查发现,我忘记种下RNG,结果是巧合的。多么尴尬!

TL; DR:结果的差异是由浮点值的排序造成的。

 !$ OMP PARALLEL DEFAULT(none)SHARED(nd,x,y,z,nzlag, (512)

!$ OMP& dzlag,dylag,dxlag,nvarg,ivhead,ivtail,ivtype,vr,tmin,tmax,np,tm,hm,gam) !$ OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type)!减少(+:np,tm,hm,gam)
DO i = 1,nd
!$ OMP CRITICAL(main)
!数据的第二个循环:
DO j = 1,nd

!滞后:
zdis = z(j) - z(i)
IF(zdis> = 0.0)THEN
izl = INT(zdis / dzlag + 0.5)
ELSE
izl = -INT(-zdis / dzlag + 0.5)
END IF
! ---- SNIP ----

!循环所有变异函数为这个滞后:

DO cur_variogram = 1,nvarg
variogram_type = ivtype(cur_variogram)

!获取头部和尾部值:

indx = i +(ivhead(cur_variogram)-1)* maxdim
vrh = vr(indx)
indx = j +(ivtail(cur_variogram) -1)* maxdim
vrt = vr(indx)
IF(vrh< tmin.OR.vrh> = tmax.OR.vrt< tmin.OR.vrt> = tmax)CYCLE

! -----问题区-------
np(ixl,iyl,izl,1)= np(ixl,iyl,izl,1)+1。 < - 这永远不会失败
tm(ixl,iyl,izl,1)= tm(ixl,iyl,izl,1)+ vrt
hm(ixl,iyl,izl,1)= hm (ixl,iyl,izl,1)+(vrh-vrt)*(vrh-vrt))(ixl,iyl,izl,1)+ vrh
gam
! -----问题区域结束-----

!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$ OMP END CRITICAL(main)
END DO
!$ OMP END DO
!$ OMP END PARALLEL

$ b

非常感谢!

解决方案

<如果你正在使用32位浮点数并且算术 84.26539 84.26538 之间的差异,那就是并行浮点算法的非确定性完全可以解释 1 在最低有效位中的差异。请记住,一个32位的f-p数字只能使用大约7位小数位。



普通的浮点运算并不严格相关。对于实数(在数学上并非Fortran意义上)数字(a + b)+ c == a +(b + c),但是对于浮点数没有这样的规则。这在维基百科有关浮点运算的文章中有很好的解释。



出现非确定性是因为在使用OpenMP时,您会放弃对运行时操作排序的控制权。线程中的值总和(例如 + 的减少)会使全局总和表达式包含在运行时间中。 2次执行相同的OpenMP程序不会产生相同的最后一位结果。



我怀疑即使运行OpenMP一个线程上的程序可能会产生与等效的非OpenMP程序不同的结果。由于对可用于OpenMP可执行文件的线程数的了解可能会延迟到运行时,编译器将不得不创建并行可执行文件,无论它是否最终并行运行。


I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.

The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.

Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.

So my question is: why doesn't a critical section work in this case?

My code is below. All shared variables except np, tm, hm, gam are read-only.

EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!

TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.

!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
  DO i=1,nd        
!$OMP CRITICAL (main)
! Second loop over the data:
    DO j=1,nd

! The lag:
      zdis = z(j) - z(i)
      IF(zdis >= 0.0) THEN
        izl =  INT( zdis/dzlag+0.5)
      ELSE
        izl = -INT(-zdis/dzlag+0.5)
      END IF
 ! ---- SNIP ----

! Loop over all variograms for this lag:

      DO cur_variogram=1,nvarg
        variogram_type = ivtype(cur_variogram)

! Get the head and tail values:

        indx = i+(ivhead(cur_variogram)-1)*maxdim
        vrh   = vr(indx)
        indx = j+(ivtail(cur_variogram)-1)*maxdim
        vrt   = vr(indx)
        IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE

        ! ----- PROBLEM AREA -------
        np(ixl,iyl,izl,1)  = np(ixl,iyl,izl,1) + 1.   ! <-- This never fails
        tm(ixl,iyl,izl,1)  = tm(ixl,iyl,izl,1) + vrt  
        hm(ixl,iyl,izl,1)  = hm(ixl,iyl,izl,1) + vrh
        gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
        ! ----- END OF PROBLEM AREA -----

        !CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
      END DO
    END DO
    !$OMP END CRITICAL (main)
  END DO
!$OMP END DO
!$OMP END PARALLEL

Thanks very much in advance!

解决方案

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.

Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.

The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.

I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

这篇关于使用OpenMP关键和有序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆