中七优化做的周期 [英] optimization of a seven do cycle

查看:117
本文介绍了中七优化做的周期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个数组,我必须这样做求和

I have 3 array and I have to do this summation

的实施code是

do i=1,320
  do j=1,320
    do k=1,10
     do l=1,10
      do m=1,10
       do r=1,10
        do s=1,10
          sum=sum+B(k,l,r,s,m)*P(i,j,r,s,m)
        end do
       end do
       A(i,j,k,l,m)=sum
     end do 
    end do 
   end do 
 end do
end do

这需要1天到执行code。 有没有一种方法来优化呢?

It takes 1 day to execute the code. Is there a way to optimize it?

感谢。

推荐答案

在这些东西的诀窍是寻找共同的模式,并利用现有的有效的程序,以加快起来。

The trick in these things is to look for common patterns and use existing efficient routines to speed them up.

MSB是,像往常一样,完​​全正确的,只是翻动你的索引会给你大量的增速,尽管英特尔的Fortran编译器具有很高的优化都已经给你一些好处的。

M.S.B is, as usual, completely right that just flipping your indices will give you substantial speedup, although intel's fortran compiler with high optimization will already give you some of that benefit.

不过我们剥去 M 指数的第二个(这是很容易做到的,因为MSB指出,这是最慢的移动指数),只是看在乘法:

But let's peel off the m index for a second (which is easy to do as, as MSB has pointed out, that's the slowest-moving index) and just look at the multiplication:

A <子> I,J,K,L =总和;乙<子> K,L,R,S &倍; P <子> I,J,R,S
  A <子> I,J,K,L =总和; P <子> I,J,R,S &倍;乙<子> K,L,R,S

Ai,j,k,l = ∑ Bk,l,r,s × Pi,j,r,s
Ai,j,k,l = ∑ Pi,j,r,s × Bk,l,r,s

整形数组:

A <子> IJ,KL =总和; P <子> IJ,RS &倍;乙<子> KL,RS
  A <子> IJ,KL =总和; P <子> IJ,RS &倍;乙 T <子> RS,KL
  A = P&倍;乙 T

Aij,kl = ∑ Pij,rs × Bkl,rs
Aij,kl = ∑ Pij,rs × BTrs,kl
A = P × BT

在这里,我们现在有矩阵乘法,对此非常有效的程序存在。因此,如果我们重塑P和B矩阵,并转B,我们可以做一个简单的矩阵乘法和重塑的结果;与此重塑甚至不会必然需要在这种情况下,任何副本。因此,改变这样的事情:

where we now have matrix multiplication, for which very efficient routines exist. So if we reshape the P and B matrices, and transpose B, we can do a simple matrix multiplication and reshape the result; and this reshape won't even necessarily require any copies in this case. So changing something like this:

program testpsum
implicit none

integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(32,32,10,10,10) :: A
integer :: psum
integer :: i, j, k, l, m, r, s

B = 1
P = 2

do i=1,32
  do j=1,32
    do k=1,10
     do l=1,10
      do m=1,10
       do r=1,10
        do s=1,10
          psum=psum+B(k,l,r,s,m)*P(i,j,r,s,m)
        end do
       end do
       A(i,j,k,l,m)=psum
       psum = 0
     end do
    end do
   end do
 end do
end do

print *,minval(A), maxval(A)

end program testpsum

要这样:

program testmatmult
implicit none

integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(10*10,10*10) :: Bmt
integer, dimension(32*32,10*10) :: Pm
integer, dimension(32,32,10,10,10) :: A
integer :: m

B = 1
P = 2

do m=1,10
    Pm  = reshape(P(:,:,:,:,m),[32*32,10*10])
    Bmt = transpose(reshape(B(:,:,:,:,m),[10*10,10*10]))
    A(:,:,:,:,m) = reshape(matmul(Pm,Bmt),[32,32,10,10])
end do

print *,minval(A), maxval(A)

end program testmatmult

给出的时序:

Gives timings of:

$ time ./psum
         200         200

real    0m2.239s
user    0m1.197s
sys 0m0.008s

$ time ./matmult
         200         200

real    0m0.064s
user    0m0.027s
sys 0m0.008s

在与 ifort -O3 -xhost -mkl 编译,所以我们可以使用快速英特尔MKL库。 它会变得更快,当你不创建一个 PM 暂时的,只是做了重塑的matmult电话,和更快的仍然(对大型矩阵),如果你使用 -mkl =平行螺纹程序。如果你不也有MKL你可以链接到一些其他快速LAPACK _GEMM程序。

when compiled with ifort -O3 -xhost -mkl so we can use the fast intel MKL libraries. It gets even faster when you don't create that Pm temporary and just do the reshape in the matmult call, and faster still (for large matrices) if you use -mkl=parallel for threaded routines. If you don't also have MKL you can just link to some other fast LAPACK _GEMM routine.

这篇关于中七优化做的周期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆