中七优化做的周期 [英] optimization of a seven do cycle
问题描述
我有3个数组,我必须这样做求和
I have 3 array and I have to do this summation
的实施code是
do i=1,320
do j=1,320
do k=1,10
do l=1,10
do m=1,10
do r=1,10
do s=1,10
sum=sum+B(k,l,r,s,m)*P(i,j,r,s,m)
end do
end do
A(i,j,k,l,m)=sum
end do
end do
end do
end do
end do
这需要1天到执行code。 有没有一种方法来优化呢?
It takes 1 day to execute the code. Is there a way to optimize it?
感谢。
推荐答案
在这些东西的诀窍是寻找共同的模式,并利用现有的有效的程序,以加快起来。
The trick in these things is to look for common patterns and use existing efficient routines to speed them up.
MSB是,像往常一样,完全正确的,只是翻动你的索引会给你大量的增速,尽管英特尔的Fortran编译器具有很高的优化都已经给你一些好处的。
M.S.B is, as usual, completely right that just flipping your indices will give you substantial speedup, although intel's fortran compiler with high optimization will already give you some of that benefit.
不过我们剥去 M
指数的第二个(这是很容易做到的,因为MSB指出,这是最慢的移动指数),只是看在乘法:
But let's peel off the m
index for a second (which is easy to do as, as MSB has pointed out, that's the slowest-moving index) and just look at the multiplication:
A <子> I,J,K,L =总和;乙<子> K,L,R,S &倍; P <子> I,J,R,S
A <子> I,J,K,L =总和; P <子> I,J,R,S &倍;乙<子> K,L,R,S
Ai,j,k,l = ∑ Bk,l,r,s × Pi,j,r,s
Ai,j,k,l = ∑ Pi,j,r,s × Bk,l,r,s
整形数组:
A <子> IJ,KL =总和; P <子> IJ,RS &倍;乙<子> KL,RS
A <子> IJ,KL =总和; P <子> IJ,RS &倍;乙 T <子> RS,KL
A = P&倍;乙 T
Aij,kl = ∑ Pij,rs × Bkl,rs
Aij,kl = ∑ Pij,rs × BTrs,kl
A = P × BT
在这里,我们现在有矩阵乘法,对此非常有效的程序存在。因此,如果我们重塑P和B矩阵,并转B,我们可以做一个简单的矩阵乘法和重塑的结果;与此重塑甚至不会必然需要在这种情况下,任何副本。因此,改变这样的事情:
where we now have matrix multiplication, for which very efficient routines exist. So if we reshape the P and B matrices, and transpose B, we can do a simple matrix multiplication and reshape the result; and this reshape won't even necessarily require any copies in this case. So changing something like this:
program testpsum
implicit none
integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(32,32,10,10,10) :: A
integer :: psum
integer :: i, j, k, l, m, r, s
B = 1
P = 2
do i=1,32
do j=1,32
do k=1,10
do l=1,10
do m=1,10
do r=1,10
do s=1,10
psum=psum+B(k,l,r,s,m)*P(i,j,r,s,m)
end do
end do
A(i,j,k,l,m)=psum
psum = 0
end do
end do
end do
end do
end do
print *,minval(A), maxval(A)
end program testpsum
要这样:
program testmatmult
implicit none
integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(10*10,10*10) :: Bmt
integer, dimension(32*32,10*10) :: Pm
integer, dimension(32,32,10,10,10) :: A
integer :: m
B = 1
P = 2
do m=1,10
Pm = reshape(P(:,:,:,:,m),[32*32,10*10])
Bmt = transpose(reshape(B(:,:,:,:,m),[10*10,10*10]))
A(:,:,:,:,m) = reshape(matmul(Pm,Bmt),[32,32,10,10])
end do
print *,minval(A), maxval(A)
end program testmatmult
给出的时序:
Gives timings of:
$ time ./psum
200 200
real 0m2.239s
user 0m1.197s
sys 0m0.008s
$ time ./matmult
200 200
real 0m0.064s
user 0m0.027s
sys 0m0.008s
在与 ifort -O3 -xhost -mkl
编译,所以我们可以使用快速英特尔MKL库。
它会变得更快,当你不创建一个 PM
暂时的,只是做了重塑的matmult电话,和更快的仍然(对大型矩阵),如果你使用 -mkl =平行
螺纹程序。如果你不也有MKL你可以链接到一些其他快速LAPACK _GEMM程序。
when compiled with ifort -O3 -xhost -mkl
so we can use the fast intel MKL libraries.
It gets even faster when you don't create that Pm
temporary and just do the reshape in the matmult call, and faster still (for large matrices) if you use -mkl=parallel
for threaded routines. If you don't also have MKL you can just link to some other fast LAPACK _GEMM routine.
这篇关于中七优化做的周期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!