为什么使用openMP原子的并行代码比串行代码需要更长的时间? [英] Why my parallel code using openMP atomic takes a longer time than serial code?

查看:79
本文介绍了为什么使用openMP原子的并行代码比串行代码需要更长的时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我的序列号的片段.

The snippet of my serial code is shown below.

 Program main
  use omp_lib
  Implicit None
   
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0

  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

    Do i = 1, 100000000
      a = a + Real(i)
    End Do

  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

经过时间:

通过使用omp指令 do atomic ,我将串行代码转换为并行代码.但是,并行程序比串行程序慢.我不明白为什么会这样.接下来是我的并行代码段:

By using omp directives do and atomic, I convert serial code into parallel code. However, the parallel program is slower than the serial program. I don't understand why this happened. The next is my parallel code snippet:

Program main
  use omp_lib
  Implicit None
    
  Integer, Parameter :: n_threads = 8
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0
 
  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

  !$OMP Parallel Num_threads(n_threads) shared(a)
  
   !$OMP Do 
     Do i = 1, 100000000
       !$OMP Atomic
       a = a + Real(i)
     End Do
   !$OMP End Do
  
  !$OMP End Parallel
  
  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

经过的时间:

所以我的问题是为什么我使用openMP原子的并行代码比串行代码需要更长的时间?

So my question is Why my parallel code using openMP atomic takes a longer time than serial code?

推荐答案

您将在每个单循环迭代中将 atomic 操作应用于同一变量.此外,该变量在这些循环迭代之间具有相互依赖性.与顺序版本比较时,自然会带来额外的开销(例如例如同步,序列化成本和CPU周期).此外,由于线程使缓存无效,您可能会遇到很多缓存未命中的情况.

You are applying an atomic operation to the same variable in every single loop iteration. Moreover, that variable has interdependencies among those loop iterations. Naturally, that comes with additional overheads (e.g., synchronization, cost of serialization, and CPU cycles) when comparing with the sequential version. Furthermore, you are probably getting a lot of cache misses due to threads invalidating their caches.

此代码是应使用变量 a ( !$ omp并行!)的 reduction 的典型代码.做减少(+:a))而不是原子操作.通过减少操作,每个线程将拥有变量'a'的私有副本,并且在 parallel region 的末尾,线程将减少其变量的副本.code>'a'(使用'+'运算符)转换为单个值,该值将传播到主线程的变量'a'.

This code is the typical code that should be using a reduction of the variable a (i.e., !$omp parallel do reduction(+:a)) instead of an atomic operation. With the reduction operation, each thread will have a private copy of the variable 'a', and at end of the parallel region, threads will reduce their copies of the variable 'a' (using the '+' operator) into a single value that will be propagated to the variable 'a' of the main thread.

您可以在此

You can find a more detailed answer about the differences between atomic vs. reduction on this SO thread. In that thread, there is even a code, which (just like yours) its atomic version is several orders of magnitude slower than its sequential counterpart (i.e., 20x slower). In that case it is even worst than yours (i.e., 20x Vs 10x).

这篇关于为什么使用openMP原子的并行代码比串行代码需要更长的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆