行主列vs列主矩阵乘法 [英] Row major vs Column Major Matrix Multiplication

查看:121
本文介绍了行主列vs列主矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试计算Matrix乘法的C程序.我通过遍历第二个矩阵的每一列来完成此任务,如下所示.

I am currently working on a C program trying to compute Matrix Multiplication.. I have approached this task by looping through each column of the second matrix as seen below.

我将尺寸设置为1000.

I have set size to 1000.

for(i=0;i<size;i++)
{
  for(j=0;j<size;j++)
  {
    for(k=0;k<size;k++)
    {
      matC[i][j]+=matA[i][k]*matB[k][j];
    }
  }
}

我想知道此实现中有问题的访问模式.什么使行/列访问比其他访问更有效?我正在尝试从使用缓存的逻辑角度来理解这一点.请帮助我理解这一点.非常感谢您的帮助:)

I wanted to know what problematic access pattern is in this implementation.. What makes row/column access more efficient than the other? I am trying to understand this in terms of logic from the use of Caches.. Please help me understand this. Your help is much appreciated :)

推荐答案

如果您正在谈论缓存的使用,那么您可能想做一些称为循环切片的事情.您将循环拆分为多个图块,以使循环的内部存储在缓存中(如今这是很大的).因此,您的循环将变成类似(如果您要使用指针将矩阵传递给函数)

If you are talking about use of Caches then you might want to do something called loop tiling. You break the loop into tiles such that inner part of the loop gets stored inside cache (which is quite large these days). So your loop will turn into something like (if you are passing the matrices into a function using pointers )

for(j=0;j<size;j+=t)
    for(k=0;k<size;k+=t)
       for(i=0;i<size;i+=t)
          for(ii=i;ii<MIN(i+t,size);ii++)
             for(jj=j;jj<MIN(j+t,size);jj++)
        {       
          var=*(c+ii * size+jj);    //Var is a scalar variable                     
          for(kk=k;kk<MIN(k+t,size);kk++)
              {                      
         var = var + *(a+ii *size +kk) * *(bt +jj * size+ kk);          
              }
          *(c+ii *size +jj) = var;
        }                       

t的值根据您获得的加速而变化.它可以t = 64,128,256,依此类推.您还可以在这里使用许多其他技术.循环平铺只是有效利用缓存的一种技术.此外,可以在发送到乘法功能之前对B矩阵进行转置.这样,您将获得矩阵B元素的线性访问. 考虑

The value of t varies depending on the speedup that you get. It can t = 64,128,256 and so on. There are many other techniques that you can use here. Loop tiling is just once technique to utilize the cache efficiently.Further, you can transpose the B matrix before you send to the multiplication function. That way you will get a linear access of elements of matrix B. To explain you more Consider

          A  --------   and B | | | |
             --------         | | | |
             --------         | | | | 
             --------         | | | |

这里,您将始终考虑将A的第一行与B的第一列相乘.并且由于您使用的是C,因此我相信CPU需要付出额外的努力才能读取矩阵B的所有列:内存中的一个.为了简化这些工作,您可以转置矩阵并获取矩阵B'的行(本质上只是B的列),然后使用循环切片来缓存最大数量的元素以进行乘法运算.希望这会有所帮助.

Here, you will always consider, to multiply the first row of A with first column of B.And since you are using C I believe, CPU requires extra efforts to read in the all the columns of matrix B one by one inside the memory. To ease up these efforts, you can transpose the matrix and get the rows of matrix B' (which are nothing but columns of B essentially) and use loop tiling to cache the maximum amount of elements for multiplication.Hope this helps.

这篇关于行主列vs列主矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆