使用OpenMP并行化嵌套循环运行缓慢 [英] With OpenMP parallelized nested loops run slow
问题描述
我有一个Fortran程序的一部分,它包含一些我想与OpenMP并行的嵌套循环。
integer :: nstates,N,i,dima,dimb,dimc,a_row,b_row,b_col,c_row,row,col
double complex,dimension(4,4):: mat
double complex,dimension(:),allocatable :: vecin,vecout
nstates = 2
N = 24
分配(vecin(nstates ** N),vecout(nstates ** N))
vecin = ...某些数据
vecout = 0
mat = reshape([...一些数据...],[4,4])
dimb = nstates ** 2
!$ OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col)
do i = 1,N-1
dima = nstates **(i-1)
dimc = nstates **(Ni-1)
做a_row = 1,dima
做b_row = 1,dimb
做c_row = 1,dimc
row = ((a_row-1)* dimb + b_row - 1)* dimc + c_row
do b_col = 1,dimb
col =((a_row-1)* dimb + b_col - 1)* dimc + c_row
!$ OMP ATOMIC
vecout(row)= vecout(row)+ vecin(col)* mat(b_row,b_col)
end do
end do
end do
end do
end do
!$ OMP END PARALLEL DO
该程序运行并且我得到的结果也是正确的,它的速度非常慢。比没有OpenMP慢很多。我对OpenMP了解不多。我使用PRIVATE或OMP ATOMIC做了什么错误吗?如果你的数组太大而且你得到的数据太多,那么我会很感激每一个建议如何提高我的代码的性能。
你可以使用可分配的临时数组来实现这个约简。正如Francois Jacq指出的那样,你也有一个由<$ c $引起的竞争条件c> dima 和 dimb
这应该是私人的。
double complex,dimension(:),allocatable :: tmp
!$ OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)
allocate(tmp(size(vecout)))
tmp = 0
!$ OMP DO
do i = 1,N-1
dima = nstates **(i-1)
dimc = nstates **(Ni-1)
do a_row = 1,dima
do b_row = 1,dimb
do c_row = 1,dimc
row =((a_row-1)* dimb + b_row - 1)* dimc + c_row
do b_col = 1,dimb
col =(( a_row-1) * dimb + b_col - 1)* dimc + c_row
tmp(row)= tmp(row)+ vecin(col)* mat(b_row,b_col)
end do
end do
end do
end do
end do
!$ OMP END DO
$ b $!$ OMP CRITICAL
vecout = vecout + tmp
!$ OMP END CRITICAL
!$ OMP END PARALLEL
I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.
integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout
nstates = 2
N = 24
allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0
mat = reshape([...some data...],[4,4])
dimb=nstates**2
!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col)
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
!$OMP ATOMIC
vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END PARALLEL DO
The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.
If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.
As Francois Jacq pointed out, you also have a race condition caused by dima
and dimb
which should be private.
double complex, dimension(:), allocatable :: tmp
!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)
allocate(tmp(size(vecout)))
tmp = 0
!$OMP DO
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END DO
!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL
这篇关于使用OpenMP并行化嵌套循环运行缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!