使用 OpenMP 关键和有序 [英] Using OpenMP critical and ordered

查看:18
本文介绍了使用 OpenMP 关键和有序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Fortran 和 OpenMP 很陌生,但我正在努力了解自己的方向.我有一段代码用于计算我试图并行化的变异函数.但是,我似乎遇到了比赛条件,因为有些结果相差了千分之一左右.

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.

问题似乎是减少.使用 OpenMP 缩减工作并给出正确的结果,但它们是不可取的,因为缩减实际上发生在另一个子例程中(我将相关行复制到 OpenMP 循环中进行测试).因此,我将减量放在 CRITICAL 部分中,但没有成功.有趣的是,这个问题只发生在实数上,而不是整数上.我考虑过添加的顺序是否有任何区别,但它们不应该产生这么大的错误.

The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.

只是为了检查一下,我将并行执行的所有操作都放在一个 ORDERED 块中,这(当然)给出了正确的结果(尽管没有任何加速).我还尝试将所有内容都放在 CRITICAL 部分中,但由于某种原因没有给出正确的结果.我的理解是 OpenMP 将在进入/退出 CRITICAL 部分时刷新共享变量,因此不应该有任何缓存问题.

Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.

所以我的问题是:为什么在这种情况下关键部分不起作用?

So my question is: why doesn't a critical section work in this case?

我的代码如下.除了 nptmhmgam 之外的所有共享变量都是只读的.

My code is below. All shared variables except np, tm, hm, gam are read-only.

我试图通过用相同范围内的随机整数替换 do 循环来模拟由多个线程引起的随机性(即在循环中生成一对 i,j;如果它们被访问",则生成新的那些),令我惊讶的是结果匹配.然而,经过进一步检查,发现我忘记给RNG播种,结果是巧合.好尴尬!

I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!

TL;DR:结果的差异是由浮点值的排序引起的.改用双精度会有所帮助.

TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.

!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
  DO i=1,nd        
!$OMP CRITICAL (main)
! Second loop over the data:
    DO j=1,nd

! The lag:
      zdis = z(j) - z(i)
      IF(zdis >= 0.0) THEN
        izl =  INT( zdis/dzlag+0.5)
      ELSE
        izl = -INT(-zdis/dzlag+0.5)
      END IF
 ! ---- SNIP ----

! Loop over all variograms for this lag:

      DO cur_variogram=1,nvarg
        variogram_type = ivtype(cur_variogram)

! Get the head and tail values:

        indx = i+(ivhead(cur_variogram)-1)*maxdim
        vrh   = vr(indx)
        indx = j+(ivtail(cur_variogram)-1)*maxdim
        vrt   = vr(indx)
        IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE

        ! ----- PROBLEM AREA -------
        np(ixl,iyl,izl,1)  = np(ixl,iyl,izl,1) + 1.   ! <-- This never fails
        tm(ixl,iyl,izl,1)  = tm(ixl,iyl,izl,1) + vrt  
        hm(ixl,iyl,izl,1)  = hm(ixl,iyl,izl,1) + vrh
        gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
        ! ----- END OF PROBLEM AREA -----

        !CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
      END DO
    END DO
    !$OMP END CRITICAL (main)
  END DO
!$OMP END DO
!$OMP END PARALLEL

提前非常感谢!

推荐答案

如果你使用 32 位浮点数和算术,84.2653984.26538 之间的区别,即最低有效位 1 的差异,完全可以通过并行浮点运算的非确定性来解释.请记住,32 位 f-p 数字只有大约 7 个十进制数字可供使用.

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.

普通浮点运算不是严格关联的.对于实数(在数学上不是 Fortran 意义上的)数字 (a+b)+c==a+(b+c) 但浮点数没有这样的规则.维基百科关于浮点运算的文章很好地解释了这一点.

Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.

出现不确定性的原因是,在使用 OpenMP 时,您将对操作顺序的控制权交给运行时.跨线程的值求和(例如对 + 的归约)将全局求和表达式的括号留给运行时.同一个 OpenMP 程序的 2 次执行甚至不一定会产生相同到最后一位的结果.

The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.

我怀疑即使在一个线程上运行 OpenMP 程序也可能会产生与等效的非 OpenMP 程序不同的结果.由于对 OpenMP 可执行文件可用的线程数量的了解可能会延迟到运行时,因此无论最终是否并行运行,编译器都必须创建一个并行化的可执行文件.

I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

这篇关于使用 OpenMP 关键和有序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆