块矩阵矩阵乘法的最佳块大小值 [英] Best block size value for block matrix matrix multiplication

查看:104
本文介绍了块矩阵矩阵乘法的最佳块大小值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用下面的C代码进行块矩阵矩阵乘法.在这种方法中,将大小为BLOCK_SIZE的块加载到最快的缓存中,以减少计算过程中的内存流量.

I want to do block matrix-matrix multiplication with the following C code.In this approach, blocks of size BLOCK_SIZE is loaded into the fastest cache in order to reduce memory traffic during calculation.

void bMMikj(double **A , double **B , double ** C , int m, int n , int p , int BLOCK_SIZE){

   int i, j , jj, k , kk ;
   register double jjTempMin = 0.0 , kkTempMin = 0.0;

   for (jj=0; jj<n; jj+= BLOCK_SIZE) {
       jjTempMin = min(jj+ BLOCK_SIZE,n); 
       for (kk=0; kk<n; kk+= BLOCK_SIZE) {
           kkTempMin = min(kk+ BLOCK_SIZE,n); 
           for (i=0; i<n; i++) {
               for (k = kk ; k < kkTempMin ; k++) {
                   for (j=jj; j < jjTempMin; j++) {
                      C[i][j]  +=  A[i][k] * B[k][j];
                   }
               }
           }
      }
   }
}

我搜索了 BLOCK_SIZE 的最合适值,然后发现 BLOCK_SIZE< = sqrt(M_fast/3) M_fast 是一级缓存.

I searched about the best suitable value of BLOCK_SIZE and I found that BLOCK_SIZE <= sqrt( M_fast / 3 ) and M_fast here is L1 cache.

在我的计算机上,我有两个L1缓存,如下面,我使用启发式方法,例如从 4 BLOCK_SIZE 开始,然后将 8 的值增加 1000 次,具有不同的矩阵大小值.

In my computer, I have two L1 cache as shown here with lstopo tool. Below, I am using heuristics like starting with a BLOCK_SIZE of 4 and increasing the value by 8 for 1000 times, with different values of matrix sizes.

希望获得最佳MFLOPS(或最少的乘法时间)值和相应的 BLOCK_SIZE 值是最合适的值.

Hopping to get the best MFLOPS ( or the least time for multiplication ) value and the corresponding BLOCK_SIZE value will be the best suitable value.

这是测试代码:

int BLOCK_SIZE = 4;
int m , n , p;
m = n = p = 1024; /* This value is also changed
                     and all the matrices are square, for simplicity
                     */
for(int i=0;i< 1000; i++ , BLOCK_SIZE += 8) {
    # aClock.start();
    test_bMMikj(A , B ,  C , loc_n , loc_n , loc_n ,BLOCK_SIZE);
    # aClock.stop();
}

测试为我提供了每个矩阵大小的不同值,并且与公式不一致.计算机模型为具有3.8GiB的英特尔®酷睿™i5-3320M CPU @ 2.60GHz×4",这是英特尔规范

Testing gives me different values for each matrix size and do not agree with the formula.The computer model is 'Intel® Core™ i5-3320M CPU @ 2.60GHz × 4 'with 3.8GiB and here is the Intel specification

另一个问题:
如果我有两个L1缓存,例如此处中的缓存,我应该考虑<代码> BLOCK_SIZE 相对于它们之一还是两者之和?

Another question :
If I have two L1 caches, like what I have in here, should I consider BLOCK_SIZE with respect to one of them or the summation of both?

推荐答案

1.块矩阵乘法:这个想法是通过重用当前存储在缓存中的数据块来最大程度地利用时间和空间局部性.您的相同代码是错误的,因为它仅包含5个循环;for块应该是6,例如:

1. Block Matrix Multiplication: The idea is to make maximum use of both temporal and spatial locality by reusing the data block currently stored in cache. Your code for the same is incorrect as it contains only 5 loops; for block there should be 6, something like :

for(int ii=0; ii<N; ii+=stride)
{
    for(int jj=0; jj<N; jj+=stride)
    {
        for(int kk=0; kk<N; kk+=stride)
        {
            for(int i=ii; i<ii+stride; ++i)
            {
                for(int j=jj; j<jj+stride; ++j)
                {
                    for(int k=kk; k<kk+stride; ++k) C[i][j] += A[i][k]*B[k][j];
                }
            }               
        }
    }
}

为简单起见,最初将N和跨度都保持为2的幂.ijk模式不是最佳选择,您应该选择kij或ikj,有关该此处.不同的访问模式具有不同的性能,您应该尝试ijk的所有排列.

Initially keep both N and stride as powers of 2 for simplicity. The ijk pattern is not the most optimal, you should either go for kij or ikj, details on that here. Different access patterns have different performances, you should try all permutations of ijk.

2.块/步幅大小:一般说来,在矩阵乘法的情况下,最快的缓存(L1)应该能够容纳3个数据块(步长*步长)以获得最佳性能,但是尝试并亲自找到它总是好的.将步幅增加8可能不是一个好主意,请尝试将步幅保持为2的幂,因为大多数块大小都是以这种方式确定大小的.而且,您只应查看数据缓存(L1d),在您的情况下为32KB.

2. Block/Stride size: It is generally said that your fastest cache (L1) should be able to accomodate 3 blocks(stride*stride) of data for optimal performance in case of matrix multiplication, but it's always good to experiment and find it for yourself. Increasing stride by 8 may not be a good idea, try to keep it as increasing powers of 2 because most block sizes are sized in that manner. And you should only look at data cache(L1d) which in your case is 32KB.

这篇关于块矩阵矩阵乘法的最佳块大小值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆