不同优化下的 Fortran 矩阵乘法性能 [英] Fortran matrix multiplication performance in different optimization

查看:26
本文介绍了不同优化下的 Fortran 矩阵乘法性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在看《Scientific Software Development with Fortran》一书,里面有一个练习我觉得很有意思:

I'm reading the book "Scientific Software Development with Fortran", and there is an exercise in it I think very interesting:

创建一个名为 MatrixMultiplyModule 的 Fortran 模块.向其中添加三个子例程,称为 LoopMatrixMultiply、IntrinsicMatrixMultiply 和 MixMatrixMultiply.每个例程都应将两个实矩阵作为参数,执行矩阵乘法,并通过第三个参数返回结果.LoopMatrixMultiply 应该完全用 do 循环编写,没有数组操作或内部过程; IntrinsicMatrixMultiply 应该使用 matmul 内部函数编写; MixMatrixMultiply 应该使用一些 do 循环和内部函数 dot_product 编写.编写一个小程序来测试它们的性能针对不同大小的矩阵执行矩阵乘法的三种不同方法."

"Create a Fortran module called MatrixMultiplyModule. Add three subroutines to it called LoopMatrixMultiply, IntrinsicMatrixMultiply, and MixMatrixMultiply. Each routine should take two real matrices as argument, perform a matrix multiplication, and return the result via a third argument. LoopMatrixMultiply should be written entirely with do loops, and no array operations or intrinsic procedures; IntrinsicMatrixMultiply should be written utilizing the matmul intrinsic function; and MixMatrixMultiply should be written using some do loops and the intrinsic function dot_product. Write a small program to test the performance of these three different ways of performing the matrix multiplication for different sizes of matrices."

我对两个 2 阶矩阵的乘法进行了一些测试,以下是不同优化标志下的结果:

I did some test of multiply of two rank 2 matrix and here are the results, under different optimization flags:

compiler:ifort version 13.0.0 on Mac 

这是我的问题:

为什么在 -O0 下它们的性能大致相同,但使用 -O3 时 matmul 的性能提升很大,而显式循环和点积的性能提升较小?另外,为什么 dot_product 似乎与显式 do 循环具有相同的性能?

Why under -O0 they have about the same performance but matmul has huge performance boost when using -O3, while explicit loop and dot product has less performance boost? Also, why dot_product seems have the same performance compare to explicit do loops?

我使用的代码如下:

module MatrixMultiplyModule

contains
    subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=0.

        do i=1,m
            do j=1,n
                do k=1,size(mtx1,dim=2)
                    mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
                end do      
            end do
        end do
    end subroutine

    subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=matmul(mtx1,mtx2)

    end subroutine

    subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))

        do i=1,m
            do j=1,n
                    mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
            end do
        end do

    end subroutine

end module




program main
use MatrixMultiplyModule
implicit none

real,allocatable        ::  a(:,:),b(:,:)
real,allocatable    ::  c1(:,:),c2(:,:),c3(:,:)
integer ::  n
integer ::  count, rate
real    ::  timeAtStart, timeAtEnd
real    ::  time(3,10)
do n=100,1000,100
    allocate(a(n,n),b(n,n))

    call random_number(a)
    call random_number(b)

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call LoopMatrixMultiply(a,b,c1)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(1,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call IntrinsicMatrixMultiply(a,b,c2)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(2,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call MixMatrixMultiply(a,b,c3)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(3,n/100)=timeAtEnd-timeAtStart


    deallocate(a,b)

end do

open(1,file="time.txt")
do n=1,10
    write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program

推荐答案

在遍历数组元素时需要注意以下几点:

There are several things one should be aware of when looping over array elements:

  • 确保内循环在内存中的连续元素上.在您当前的循环"算法中,内部循环超过索引 k.由于矩阵在内存中作为列布局(第一个索引在通过内存时变化最快),访问新的 k 值可能需要将新页面加载到缓存中.在这种情况下,您可以通过将循环重新排序来优化您的算法:

  • Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:

do j=1,n
    do k=1,size(mtx1,dim=2)
        do i=1,m
            mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
        end do      
    end do
end do

现在,内部循环在内存中的连续元素上(mtx2(k,j) 值可能只会在编译器内部循环之前获取一次,如果没有,您可以存储它在循环之前的一个临时变量中)

now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)

确保整个循环都可以放入缓存中,以避免过多的缓存未命中.这可以通过阻塞算法来完成.在这种情况下,解决方案可能是例如:

Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:

l=size(mtx1,dim=2)
ichunk=512 ! I have a 3MB cache size (real*4)
do jj=1,n,ichunk
    do kk=1,l,ichunk

do j=jj,min(jj+ichunk-1,n)
    do k=kk,min(kk+ichunk-1,l)
        do i=1,m
            mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
        end do      
    end do
end do

    end do
end do

在这种情况下,性能将取决于 ichunk 的大小,尤其是对于足够大的矩阵(您甚至可以阻塞内部循环,这只是一个示例).

in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).

确保执行循环所需的工作远小于循环内部的工作.这可以通过循环展开"来解决,即在循环的一次迭代中组合多个语句.通常编译器可以通过提供标志 -funroll-loops 来做到这一点.

Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.

如果我使用上面的代码并使用标志 -O3 -funroll-loops 进行编译,我得到的性能比使用 matmul 稍好.

If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.

这三点中要记住的重要一点是关于循环排序的第一点,因为这会影响其他用例中的性能,而编译器通常无法修复它.循环展开,您可以留给编译器(但对其进行测试,因为这并不总能提高性能).至于第二点,由于这取决于硬件,因此您不应该(通常)尝试自己实现一个非常有效的矩阵乘法,而是考虑使用诸如 e.g. 之类的库.atlas,它可以优化缓存大小,或者 MKL 或 ACML 等供应商库.

The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.

这篇关于不同优化下的 Fortran 矩阵乘法性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆