openMP不会改进运行时 [英] openMP not improving runtime
问题描述
我继承了一段Fortran代码,因为我负责将它并行化为我们拥有的8核机器。我有两个版本的代码,我试图使用openMP编译器指令来加速它。它适用于一段代码,但不适用于其他代码,我无法弄清楚为什么 - 它们几乎完全相同!
我运行了带有和没有openMP标签的每段代码,第一个代码显示速度提高,但不是第二个。代码示例1 :(显着改进)
!$ OMP并行DO
DO IN2 = 1,NN(2)
DO IN1 = 1,NN(1)
SCATT(IN1,IN2)= DATA (IN2-1)* NN(1)+ IN1)/(NN(1)* NN(2))
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$ OMP END并行DO
代码示例2:(no改善)
!$ OMP并行DO
DO IN2 = 1,NN(2)
DO IN1 = 1,NN(1)
SCATTREL = DATA(2 *((IN2-1)* NN(1)+ IN1)-1))/ NN(1)* NN(2))
SCATTIMG = DATA(2 *((IN2-1)* NN(1)+ IN1)))/ NN(1)* NN(2))
SCATT(IN1,IN2)= DCOMPLX(SCATREL,SCATIMG)
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$ OMP END PARALLEL DO
我认为这可能是内存过剩等问题,并尝试过各种合作将变量放在shared()和private()子句中的组合,但它们或者导致分段错误或者使其更慢。
我也认为这可能是因为我没有在循环中做足够的工作来看待改进,但是由于小循环没有改进感觉对我来说。
任何人都可以看到我能做的事情,看到第二个实际速度提升吗?
代码示例1的速度提升数据:
平均运行时间(整个代码不仅仅是这个代码段) p>
没有openMP标签:2m 21.321s
使用openMP标签:2m 20.640s
平均运行时间(简介为此代码段)
没有openMP标签:6.3s
使用openMP标签:4.75s
代码示例2的速度提升数据:
平均运行时间(整个代码不仅仅是此代码段)如果没有openMP标签:4m 46.659s
使用openMP标签:4m 49.200s
$ b
$ b
平均运行时间(只是此代码段的配置文件)
没有openMP标签: $ b解决方案观察到代码运行速度慢于并行比串行告诉我,罪魁祸首很可能是虚假分享。
SCATT
数组是shared
,每个线程访问它的一个片段阅读和写作。在你的代码中没有竞争条件,但是写入同一个数组的线程(尽管不同的片)会让事情变慢。
原因是每个线程都会在缓存中加载数组的一部分
SCATT
,并且每当另一个线程写入SCATT
的一部分会使以前存储在缓存中的数据无效。尽管由于没有竞争条件(其他线程更新了SCATT
的不同片段),输入数据并未更改,处理器会收到缓存无效并因此重新加载的信号数据(详见上面的链接)。这会导致高数据传输开销。
解决方法是让每个切片对给定线程都是私有的。在你的情况下,它更简单,因为你根本不需要读取
SCATT
。只需要替换
pre $ $ $ $ $ SCATT(IN1,IN2)= DCOMPLX(SCATREL,SCATIMG)
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0
与
SCATT(IN1,IN2)= SCATT0
其中 SCATT0
是私人
变量。
为什么这不会在第一个片段中发生?不过,我确实怀疑编译器可能已经优化了这个问题。当它计算 DATA((IN2-1)* NN(1)+ IN1)/(NN(1)* NN(2))
时,它很可能将其存储在注册并在 UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0中使用此值而不是
SCATT(IN1,IN2)
code>。
除此之外,如果你想加快代码速度,你应该让循环更有效率。并行化的第一条规则是不要这样做!首先优化串行代码。因此,将代码片段1替换为(您甚至可以在最后一行的
DATA /(NN(1)* NN(2))
!$ OMP PARALLEL DO private(temp)
DO IN2 = 1,NN(2)
temp =( IN2-1)* NN(1)
SCATT(:IN2)= DATA(temp + 1:temp + NN(1))
ENDDO
!$ OMP END并行DO
UADI = SCATT + 1.0
您也可以使用代码段2做类似的事情。
I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical! I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...
Code sample 1: (significant improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
Code sample 2: (no improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.
I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.
Can anyone shed some light onto what I can to do see a real speed boost in the second one?
Data on speed boost for code sample 1:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 2m 21.321s
With openMP tags: 2m 20.640s
Average runtime (profile for just this snippet)
Without openMP tags: 6.3s
With openMP tags: 4.75s
Data on speed boost for code sample 2:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 4m 46.659s
With openMP tags: 4m 49.200s
Average runtime (profile for just this snippet)
Without openMP tags: 15.14s
With openMP tags: 46.63s
The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.
The SCATT
array is shared
and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.
The reason is that each thread loads a portion of the array SCATT
in cache and whenever another thread writes in that portion of SCATT
this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT
) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.
The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT
at all. Just replace
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
with
SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0
where SCATT0
is a private
variable.
And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
it very likely stored it in a register and used this value instead of SCATT(IN1,IN2)
in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
.
Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare
around the last line)
DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
temp = (IN2-1)*NN(1)
SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0
You can do something similar with snippet 2 as well.
这篇关于openMP不会改进运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!