openMP不会改进运行时 [英] openMP not improving runtime

查看:247
本文介绍了openMP不会改进运行时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我继承了一段Fortran代码,因为我负责将它并行化为我们拥有的8核机器。我有两个版本的代码,我试图使用openMP编译器指令来加速它。它适用于一段代码,但不适用于其他代码,我无法弄清楚为什么 - 它们几乎完全相同!
我运行了带有和没有openMP标签的每段代码,第一个代码显示速度提高,但不是第二个。代码示例1 :(显着改进)

 !$ OMP并行DO 
DO IN2 = 1,NN(2)
DO IN1 = 1,NN(1)
SCATT(IN1,IN2)= DATA (IN2-1)* NN(1)+ IN1)/(NN(1)* NN(2))
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$ OMP END并行DO

代码示例2:(no改善)

 !$ OMP并行DO 
DO IN2 = 1,NN(2)
DO IN1 = 1,NN(1)
SCATTREL = DATA(2 *((IN2-1)* NN(1)+ IN1)-1))/ NN(1)* NN(2))
SCATTIMG = DATA(2 *((IN2-1)* NN(1)+ IN1)))/ NN(1)* NN(2))
SCATT(IN1,IN2)= DCOMPLX(SCATREL,SCATIMG)
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$ OMP END PARALLEL DO

我认为这可能是内存过剩等问题,并尝试过各种合作将变量放在shared()和private()子句中的组合,但它们或者导致分段错误或者使其更慢。



我也认为这可能是因为我没有在循环中做足够的工作来看待改进,但是由于小循环没有改进感觉对我来说。

任何人都可以看到我能做的事情,看到第二个实际速度提升吗?



代码示例1的速度提升数据

平均运行时间(整个代码不仅仅是这个代码段) p>

 没有openMP标签:2m 21.321s 

使用openMP标签:2m 20.640s

平均运行时间(简介为此代码段)

 没有openMP标签:6.3s 

使用openMP标签:4.75s

代码示例2的速度提升数据

平均运行时间(整个代码不仅仅是此代码段)如果没有openMP标签:4m 46.659s

使用openMP标签:4m 49.200s

$ b

 $ b  

平均运行时间(只是此代码段的配置文件)

没有openMP标签: $ b 

解决方案

观察到代码运行速度慢于并行比串行告诉我,罪魁祸首很可能是虚假分享

SCATT 数组是 shared ,每个线程访问它的一个片段阅读和写作。在你的代码中没有竞争条件,但是写入同一个数组的线程(尽管不同的片)会让事情变慢。

原因是每个线程都会在缓存中加载数组的一部分 SCATT ,并且每当另一个线程写入 SCATT 的一部分会使以前存储在缓存中的数据无效。尽管由于没有竞争条件(其他线程更新了 SCATT 的不同片段),输入数据并未更改,处理器会收到缓存无效并因此重新加载的信号数据(详见上面的链接)。这会导致高数据传输开销。

解决方法是让每个切片对给定线程都是私有的。在你的情况下,它更简单,因为你根本不需要读取 SCATT 。只需要替换

pre $ $ $ $ $ SCATT(IN1,IN2)= DCOMPLX(SCATREL,SCATIMG)
UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0



SCATT(IN1,IN2)= SCATT0

其中 SCATT0 私人变量。



为什么这不会在第一个片段中发生?不过,我确实怀疑编译器可能已经优化了这个问题。当它计算 DATA((IN2-1)* NN(1)+ IN1)/(NN(1)* NN(2))时,它很可能将其存储在注册并在 UADI(IN1,IN2)= SCATT(IN1,IN2)+1.0中使用此值而不是 SCATT(IN1,IN2) code>。



除此之外,如果你想加快代码速度,你应该让循环更有效率。并行化的第一条规则是不要这样做!首先优化串行代码。因此,将代码片段1替换为(您甚至可以在最后一行的 workshare 中进行操作)

  DATA /(NN(1)* NN(2))
!$ OMP PARALLEL DO private(temp)
DO IN2 = 1,NN(2)
temp =( IN2-1)* NN(1)
SCATT(:IN2)= DATA(temp + 1:temp + NN(1))
ENDDO
!$ OMP END并行DO
UADI = SCATT + 1.0

您也可以使用代码段2做类似的事情。


I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical! I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...

Code sample 1: (significant improvement)

    !$OMP PARALLEL DO
    DO IN2=1,NN(2)
        DO IN1=1,NN(1)
            SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
            UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
        ENDDO
    ENDDO
    !$OMP END PARALLEL DO

Code sample 2: (no improvement)

    !$OMP PARALLEL DO
    DO IN2=1,NN(2)
        DO IN1=1,NN(1)
            SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
            SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
            SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
            UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
        ENDDO
    ENDDO        
    !$OMP END PARALLEL DO

I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.

I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.

Can anyone shed some light onto what I can to do see a real speed boost in the second one?

Data on speed boost for code sample 1:

Average runtime (for the whole code not just this snippet)

Without openMP tags: 2m 21.321s 

With openMP tags: 2m 20.640s

Average runtime (profile for just this snippet)

Without openMP tags: 6.3s

With openMP tags: 4.75s

Data on speed boost for code sample 2:

Average runtime (for the whole code not just this snippet)

Without openMP tags: 4m 46.659s

With openMP tags: 4m 49.200s

Average runtime (profile for just this snippet)

Without openMP tags: 15.14s

With openMP tags: 46.63s

解决方案

The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.

The SCATT array is shared and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.

The reason is that each thread loads a portion of the array SCATT in cache and whenever another thread writes in that portion of SCATT this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.

The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT at all. Just replace

SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0

with

SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0

where SCATT0 is a private variable.

And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2)) it very likely stored it in a register and used this value instead of SCATT(IN1,IN2) in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0.

Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare around the last line)

DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
    temp = (IN2-1)*NN(1)
    SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0

You can do something similar with snippet 2 as well.

这篇关于openMP不会改进运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆