行主排序对矩阵矢量乘法更有效吗? [英] Is row-major ordering more efficient for matrix-vector multiplication?

查看:42
本文介绍了行主排序对矩阵矢量乘法更有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果 M 是一个nxm矩阵,而 v u 是向量,那么就索引而言,矩阵-向量乘法看起来像 u [i] = sum(M [i,j] v_j,1 <= j< = m).由于 v 是向量,因此对于面向数字计算的语言,其元素可能存储在连续的存储位置中.如果 M 以行优先顺序存储(如C,Mathematica和Pascal),则总和中的后续 M [i,j] 也将存储在 j 递增连续的内存位置,从而使迭代非常有效.如果按列优先顺序存储(如Fortran,Matlab,R和Julia),则递增 j 要求移动等于外部矩阵跨度的多个存储位置,在此情况下大小写等于 n .对于许多行的矩阵,这似乎天真地效率较低.(对于矩阵矩阵乘法,不会出现此问题,因为在任何一种排序约定下,增加求和索引都需要在一个矩阵或另一个矩阵的内存中进行较大的移动.)

If M is an n x m matrix and v and u are vectors, then in terms of indices, matrix-vector multiplication looks like u[i] = sum(M[i,j] v_j, 1 <= j <= m). Since v is a vector, its elements are presumably stored in consecutive memory locations for numerical-computation-oriented languages. If M is stored in row-major order (as in C, Mathematica, and Pascal), then the subsequent M[i,j] in the sum are also stored in consecutive memory locations as j is incremented, making the iteration very efficient. If it's stored in column-major order (as in Fortran, Matlab, R, and Julia), then incrementing j requires moving over by a number of memory locations equal to the outer matrix stride, which in this case equals n. This naively seems less efficient for matrices with many rows. (For matrix-matrix multiplication the problem doesn't come up, because under either ordering convention, incrementing the summed index requires moving by the major stride in one matrix's memory or the other.)

在大多数计算机体系结构中,与乘法和加法运算相比,在内存中移动一个单位和移动许多单位之间的区别是可忽略不计的吗?(我猜是微不足道的",因为在实践中,Fortran通常至少与C一样快,但是任何人都可以解释为什么吗?)

Is the difference between moving over in memory by one unit and by many units appreciable or negligible in most computer architectures, compared to the multiplication and addition operations? (I'm guessing "negligible", since in practice Fortran is typically at least as fast as C, but can anyone elaborate why?)

推荐答案

在大多数计算机体系结构中,至少在原理上,预期差异很大.

The difference is expected to be high in most computer architectures, at least in principle.

矩阵向量乘法是一种内存受限的计算,因为内存的重用率很低.v的所有(N)个分量都被重用以计算u的每个元素,但是矩阵(N ^ 2)的每个元素仅使用一次.如果我们考虑典型内存的延迟(请参见例如 https://gist.github.com/hellerbarde/2843375 )与执行浮点操作所需的时间(小于1ns)相比(小于100ns),我们看到大部分时间都花在了从数组中加载和存储数组的值上.

Matrix-vector multiplication is a memory-bound computation because the reusage of memory is low. All (N) components of v are reused to compute each element of u but each element of the matrix (N^2) is just used once. If we consider the latency of a typical memory (see e.g., https://gist.github.com/hellerbarde/2843375) as (less than) 100ns compared to the time required to perform a floating point operation (less than 1ns) we see that the majority of time is spent loading and storing values from/to arrays.

我们仍然可以实现缓存友好型,即尽可能具有数据局部性.由于内存是作为行加载到缓存的,因此我们必须尽可能多地使用已加载的缓存行.这就是为什么访问连续内存区域可以减少从内存加载数据所花费的时间.

We still can implement it cache-friendly, i.e. having data locality as much as possible. Since the memory is loaded to the cache as lines, we have to use a loaded line of cache as much as possible. That is why accessing contiguous memory regions reduce the time spent loading data from memory.

为此,让我们尝试一个非常简单的代码:

To support this, let us try a very simple code:

program mv
integer, parameter :: n=10000
real, allocatable :: M(:,:), v(:), u(:)
real :: start, finish
integer :: i, j
allocate(M(n,n),v(n),u(n))
call random_number(M)
call random_number(v)
u(:)=0.
call cpu_time(start)
do i=1,n
do j=1,n
    ! non-contiguous order
    u(i)=u(i)+M(i,j)*v(j)
    ! contiguous order
    ! u(i)=u(i)+M(j,i)*v(j)
enddo
enddo
call cpu_time(finish)
print*,'elapsed time: ',finish-start
end program mv

一些结果:

               non-contiguous order   contiguous order
gfortran -O0            1.                 0.5
gfortran -O3           0.3                 0.1
ifort -O0              1.5                0.85
ifort -O3            0.037               0.035

如您所见,区别在于无需优化即可进行大量编译.启用优化gfortran仍然显示出显着差异,而使用ifort则只有很小的差异.查看编译器报告,似乎编译器互换了循环,从而导致对内部循环的连续访问.

As you can see, the difference is significant compiling without optimizations. Enabling optimization gfortran still shows significant differences, whereas with ifort there is only a small difference. Looking at the compiler report, it seems that the compiler interchanged the loops, thus leading to a contiguous access on the inner loop.

但是,我们可以说具有行优先顺序的语言对于矩阵矢量计算更有效吗?不,我不能这么说.不仅因为编译器可以补偿差异.代码本身并不了解有关M的行和列的所有信息:它基本上知道M具有两个索引,其中之一(取决于语言)在内存中是连续的.对于矩阵向量,最适合数据局部性的是将快速"索引映射到矩阵行索引.您可以使用行主要"和列主要"两种语言来实现.您只需要根据此存储M的值.例如,如果您具有代数"矩阵

However, can we say that a language having row-major ordering is more efficient for matrix-vector computation? No, I cannot say that. Not only because the compiler can compensate the difference. The code itself does not know everything about rows and columns of M: it basically knows that M has two indexes, one of which -- depending on the language -- contiguous in memory. For matrix-vector the best for data locality is having the "fast" index mapped to the matrix row index. You can achieve this with both "row-major" and "column-major" languages. You just have to store the values of M according to this. As an example if you have the "algebraic" matrix

     [ M11 M12 ]
M =  [         ]
     [ M21 M22 ]

您将其存储为计算矩阵"

you store it as "computational matrix"

C       ==> M[1,1] = M11 ; M[1,2] = M12 ; M[2,1] = M21 ; M[2,2] = M22 
Fortran ==> M[1,1] = M11 ; M[2,1] = M12 ; M[1,2] = M21 ; M[2,2] = M22 

,以便您始终在代数矩阵"行中连续.计算机不了解初始矩阵的任何信息,但我们知道计算矩阵是代数矩阵的转置形式.在这两种情况下,我都会使内部循环在连续的索引上进行迭代,最终结果将是相同的向量.

so that you are always contiguous in a "algebraic matrix" row. The computer does not know anything on the initial matrix but we know that the computational matrix is the tranposed version of the algebraic matrix. In both cases, I will have the inner loop iterating over a contiguous index and the final result will be the same vector.

在复杂的代码中,如果我已经分配了矩阵并用值填充了该矩阵,而又无法决定存储转置矩阵,则行优先"语言可能会表现出最佳性能.但是,交换循环(请参见 https://en.wikipedia.org/wiki/Loop_interchange ),由intel编译器自动完成,并由BLAS实现(请参见 http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f_source.html ),将差异减小到非常小的差异值.因此,使用Fortran您可以选择:

In a complex code, if I have already allocated and filled the matrix with values and I cannot decide to store the transposed matrix, it is potentially possible that the "row-major" language gives best performances. But, interchanging the loops (see https://en.wikipedia.org/wiki/Loop_interchange) as automatically done by intel compilers and as done by BLAS implementations (see http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f_source.html), reduce the differences to very small differences values. Therefore, using Fortran you can prefer:

do j=1,n
    do i=1,n
        u(i)=u(i)+M(i,j)*v(j)
    enddo
enddo

这篇关于行主排序对矩阵矢量乘法更有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆