Fortran矩阵乘法性能在不同的优化中 [英] Fortran matrix multiplication performance in different optimization

查看:187
本文介绍了Fortran矩阵乘法性能在不同的优化中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读用Fortran进行科学软件开发一书,其中有一个练习,我认为它很有趣: 创建一个名为Fortran的模块MatrixMultiplyModule为它添加三个子例程,分别叫做LoopMatrixMultiply,IntrinsicMatrixMultiply和MixMatrixMultiply,每个例程应该以两个实矩阵为参数,执行一个矩阵乘法,并通过第三个参数返回结果LoopMatrixMultiply应该完全用do循环写,而不是数组操作或内部过程;应使用matmul内部函数来编写IntrinsicMatrixMultiply;并且应使用一些do循环和内部函数dot_product来编写MixMatrixMultiply。编写一个小程序来测试执行矩阵乘法的三种不同方式的性能不同大小的矩阵。



我做了两个二阶矩阵乘法的测试,这里是结果,在不同的优化下离子标志:






 编译器:Mac上的ifort版本13.0.0 b $ b  

这是我的问题:



-O0它们具有相同的性能,但在使用-O3时matmul具有巨大的性能提升,而显式循环和点积的性能提升较少?此外,为什么dot_product看起来与显式的do循环相比具有相同的性能?



我使用的代码如下:

 模块MatrixMultiplyModule 

包含
子例程LoopMatrixMultiply(mtx1,mtx2,mtx3)
real,intent(in):: mtx1( :,:),mtx2(:, :)
real,intent(out),allocatable :: mtx3(:,:)
integer :: m,n
integer :: i, j
if(size(mtx1,dim = 2)/ = size(mtx2,dim = 1))stopinput array size not match
m = size(mtx1,dim = 1)
n =大小(mtx2,dim = 2)
分配(mtx3(m,n))
mtx3 = 0。

do i = 1,m
do j = 1,n
do k = 1,size(mtx1,dim = 2)
mtx3(i,j )= mtx3(i,j)+ mtx1(i,k)* mtx2(k,j)
end do
end do
end do
end subroutine

子程序IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
实数,意图(in):: mtx1(:,:),mtx2(:, :)
实数,意图(输出),可分配:: mtx3(:, :)
integer :: m,n
integer :: i,j $ b $ if(size(mtx1,dim = 2)/ = size(mtx2,dim = 1))stopinput array size not match
m = size(mtx1,dim = 1)
n = size(mtx2,dim = 2)
allocate(mtx3(m,n))
mtx3 = matmul(mtx1,mtx2)

结束子程序

子程序MixMatrixMultiply(mtx1,mtx2,mtx3)
real,intent(in): :mtx1(:,:),mtx2(:, :)
real,意图(out),allocatable :: mtx3( :,:)
整数:: m,n
整数:: i,j
if(size(mtx1,dim = 2)/ = size(mtx2,dim = 1))stop 输入数组大小不匹配
m = size(mtx1,dim = 1)
n = size(mtx2,dim = 2)
分配(mtx3(m,n))

do i = 1,m
do j = 1,n
mtx3(i,j)= dot_product(mtx1(i,:),mtx2(:,j))
end do
end do

结束子程序

结束模块




程序main
使用MatrixMultiplyModule
隐式无

实数,allocatable :: a(:,:),b(:, :)
实数,allocatable :: c1( :,:),c2(:,:),c3(:, :)
integer :: n
integer :: count,rate
real :: timeAtStart,timeAtEnd
real :: time(3,10)
do n = 100,1000,100
allocate(a(n,n),b(n,n))

call random_number(a)
调用random_number(b)

调用system_clock(count = count,c ount_rate = rate)
timeAtStart = count / real(rate)
调用LoopMatrixMultiply(a,b,c1)
调用system_clock(count = count,count_rate = rate)
timeAtEnd = count / real(rate)
time(1,n / 100)= timeAtEnd-timeAtStart

call system_clock(count = count,count_rate = rate)
timeAtStart = count / real (rate)
调用IntrinsicMatrixMultiply(a,b,c2)
调用system_clock(count = count,count_rate = rate)
timeAtEnd = count / real(rate)
time(2 ,n / 100)= timeAtEnd-timeAtStart

调用system_clock(count = count,count_rate = rate)
timeAtStart = count / real(rate)
调用MixMatrixMultiply(a,b ,c3)
call system_clock(count = count,count_rate = rate)
timeAtEnd = count / real(rate)
time(3,n / 100)= timeAtEnd-timeAtStart


deallocate(a,b)

end do

open(1,file =time.txt)
do n = 1,10
write(1,*)time(:,n)
end do
close (1)
deallocate(c1,c2,c3)
结束程序


解决方案

在循环数组元素时,应该注意以下几点:


  • 确保内部循环遍历内存中的连续元素。在当前的'循环'算法中,内部循环在索引k之上。由于矩阵在内存中按列排列(第一个索引在访问内存时变化最快),因此访问新的k值可能需要将新页面加载到缓存中。在这种情况下,您可以通过重新排序循环来优化算法:

      do j = 1,n 
    do k = 1,size(mtx1,dim = 2)
    do i = 1,m
    mtx3(i,j)= mtx3(i,j)+ mtx1(i,k)* mtx2 ,j)
    end do
    end do
    end do

    现在,内部循环在内存中的连续元素上(如果不是,编译器在内部循环之前, mtx2(k,j)值可能只会被提取一次您可以在循环之前将它存储在一个临时变量中)确保整个循环可以放入缓存中,以避免缓存遗漏太多。这可以通过阻止算法来完成。在这种情况下,解决方案可以是:例如:

      l = size(mtx1,dim = 2)
    ichunk = 512!我有一个3MB的缓存大小(实数* 4)
    do jj = 1,n,ichunk
    do kk = 1,l,ichunk

    do j = jj,min (i,j)= mtx3(i,j),其中k是第k个单元的数目, i,j)+ mtx1(i,k)* mtx2(k,j)
    end do
    end do
    end do

    end do
    在这种情况下,性能将取决于 ichunk 的大小, code>,尤其是对于足够大的矩阵(你甚至可以阻止内部循环,这只是一个例子)。
  • 确保所需的工作执行循环比循环内的工作小得多。这可以通过循环展开来解决,即在循环的一次迭代中组合几个语句。通常情况下,编译器可以通过提供标志 -funroll-loops 来完成此操作。


如果我使用上面的代码并使用标记 -O3 -funroll-loops 进行编译,那么性能会比使用 matmul



重要的是要记住这三点是关于循环排序的第一点,因为这会影响其他性能用例,编译器通常无法解决这个问题。循环展开,你可以离开编译器(但是测试它,因为这并不总是提高性能)。至于第二点,由于这取决于硬件,因此您不应(通常)尝试自己实现非常高效的矩阵乘法,而应考虑使用库(例如,可优化缓存大小的Atlas,或MKL或ACML等供应商库。

I'm reading the book "Scientific Software Development with Fortran", and there is an exercise in it I think very interesting:

"Create a Fortran module called MatrixMultiplyModule. Add three subroutines to it called LoopMatrixMultiply, IntrinsicMatrixMultiply, and MixMatrixMultiply. Each routine should take two real matrices as argument, perform a matrix multiplication, and return the result via a third argument. LoopMatrixMultiply should be written entirely with do loops, and no array operations or intrinsic procedures; IntrinsicMatrixMultiply should be written utilizing the matmul intrinsic function; and MixMatrixMultiply should be written using some do loops and the intrinsic function dot_product. Write a small program to test the performance of these three different ways of performing the matrix multiplication for different sizes of matrices."

I did some test of multiply of two rank 2 matrix and here are the results, under different optimization flags:

compiler:ifort version 13.0.0 on Mac 

Here is my question:

Why under -O0 they have about the same performance but matmul has huge performance boost when using -O3, while explicit loop and dot product has less performance boost? Also, why dot_product seems have the same performance compare to explicit do loops?

The code I use is the following:

module MatrixMultiplyModule

contains
    subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=0.

        do i=1,m
            do j=1,n
                do k=1,size(mtx1,dim=2)
                    mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
                end do      
            end do
        end do
    end subroutine

    subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=matmul(mtx1,mtx2)

    end subroutine

    subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))

        do i=1,m
            do j=1,n
                    mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
            end do
        end do

    end subroutine

end module




program main
use MatrixMultiplyModule
implicit none

real,allocatable        ::  a(:,:),b(:,:)
real,allocatable    ::  c1(:,:),c2(:,:),c3(:,:)
integer ::  n
integer ::  count, rate
real    ::  timeAtStart, timeAtEnd
real    ::  time(3,10)
do n=100,1000,100
    allocate(a(n,n),b(n,n))

    call random_number(a)
    call random_number(b)

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call LoopMatrixMultiply(a,b,c1)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(1,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call IntrinsicMatrixMultiply(a,b,c2)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(2,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call MixMatrixMultiply(a,b,c3)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(3,n/100)=timeAtEnd-timeAtStart


    deallocate(a,b)

end do

open(1,file="time.txt")
do n=1,10
    write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program

解决方案

There are several things one should be aware of when looping over array elements:

  • Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:

    do j=1,n
        do k=1,size(mtx1,dim=2)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    

    now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)

  • Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:

    l=size(mtx1,dim=2)
    ichunk=512 ! I have a 3MB cache size (real*4)
    do jj=1,n,ichunk
        do kk=1,l,ichunk
    
    do j=jj,min(jj+ichunk-1,n)
        do k=kk,min(kk+ichunk-1,l)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    
        end do
    end do
    

    in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).

  • Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.

If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.

The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.

这篇关于Fortran矩阵乘法性能在不同的优化中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆