矩阵乘法:矩阵规模小的差别,在计时差异较大 [英] Matrix multiplication: Small difference in matrix size, large difference in timings

查看:195
本文介绍了矩阵乘法:矩阵规模小的差别,在计时差异较大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵乘法code,它是这样的:

 为(i = 0; I<尺寸;我++)
    为(J = 0; J<尺寸; J ++)
        对于(K = 0; K<大小; k ++)
            C [维* I + J] + = A [尺寸* I + K] * B [尺寸* K + J]。
 

下面,矩阵的大小重新由 psented $ P $尺寸。 现在,如果矩阵的大小是2000,它需要147个秒运行这块code,而如果该矩阵的大小为2048,它需要447秒。因此,尽管在没有差异。乘法是(2048 * 2048 * 2048)/(2000 * 2000 * 2000)= 1.073,在时序的差异147分之447= 3。有人可以解释为什么出现这种情况?我预期线性扩展,这不会发生。我不是想做出最快矩阵乘法code,无非是想了解为什么会发生。

规格:AMD Opteron双核心节点(2.2GHz的),2G内存,GCC 4.5.0 v

编译为

计划GCC -O3 simple.c

我在英特尔的编译器ICC运行这个为好,看到了类似的结果。

编辑:

由于建议的意见/答案,我跑了code与尺寸= 2060,它需要145秒。

继承人的完整的程序:

 的#include< stdlib.h中>
#包括< stdio.h中>
#包括< SYS / time.h中>

/ *改变尺寸大小根据需要* /
const int的大小= 2048;
timeval结构电视;

双时间戳()
{
        双T;
        函数gettimeofday(安培;电视,NULL);
        T = tv.tv_sec +(tv.tv_usec / 1000000.0);
        返回吨;
}

INT主(INT ARGC,字符* argv的[])
{
        INT I,J,K;
        双* A,* B * C,起点,终点;

        A =(双*)malloc的(尺寸*尺寸*的sizeof(双));
        B =(双*)malloc的(尺寸*尺寸*的sizeof(双));
        C =(双*)malloc的(尺寸*尺寸*的sizeof(双));

        函数srand(292);

        对于(i = 0; I<尺寸;我++)
                为(J = 0; J<尺寸; J ++)
                {
                        A [尺寸* I + J] =(RAND()/(RAND_MAX + 1.0));
                        B〔维* I + J] =(RAND()/(RAND_MAX + 1.0));
                        C [维* I + J] = 0.0;
                }

        开始=时间戳();
        对于(i = 0; I<尺寸;我++)
                为(J = 0; J<尺寸; J ++)
                        对于(K = 0; K<大小; k ++)
                                C [维* I + J] + = A [尺寸* I + K] *
                                        B〔维* K + J]。

        结束=时间戳();
        的printf(\纳秒:%F \ N,最终启动);

        免费(A);
        免费的(B);
        免费(C);

        返回0;
}
 

解决方案

下面是我的大胆猜测:缓存

这可能是因为你能适应两行2000 的S到缓存中。这是性能稍微低于32KB的L1缓存较少。 (同时留有余地等必要的东西)

但是,当你碰到它高达2048,它使用的全部缓存(和你洒了一些,因为你需要空间,其他的事情)

假设高速缓存策略LRU,溢出缓存只是一点点就会导致整个行被反复刷新,重新加载到L1高速缓存。

另一种可能是高速缓存相关性,由于对两电。虽然我认为处理器是2路L1关联,所以我不认为在这种情况下重要的。 (但我会扔的想法在那里呢)

可能的解释2:冲突缓存由于对L2缓存超定位失误

B 阵列正在迭代的列。所以访问被跨入。您的总数据量为 2K x 2K分辨率其中约32 MB每个矩阵。这比你的二级缓存大得多。

当数据没有被完全对齐,则必须对B.体面空间局部性尽管你跳频行和仅使用每超高速缓存行一种元素,超高速缓存行停留在L2高速缓存到由中间的下一次迭代中重复使用循环。

然而,当数据被完美(2048)一致,这些节点将在同一个缓存的方式所有的土地,将远远超过你的L2高速缓存相关性。因此, B 的存取的高速缓存行不会留在缓存为下一次迭代。 相反,它们将需要从RAM一路拉。

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Here, the size of the matrix is represented by dimension. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

Program compiled as gcc -O3 simple.c

I have run this on Intel's icc compiler as well, and seen similar results.

EDIT:

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

Heres the complete program:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv; 

double timestamp()
{
        double t;
        gettimeofday(&tv, NULL);
        t = tv.tv_sec + (tv.tv_usec/1000000.0);
        return t;
}

int main(int argc, char *argv[])
{
        int i, j, k;
        double *A, *B, *C, start, end;

        A = (double*)malloc(dimension*dimension*sizeof(double));
        B = (double*)malloc(dimension*dimension*sizeof(double));
        C = (double*)malloc(dimension*dimension*sizeof(double));

        srand(292);

        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                {   
                        A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        C[dimension*i+j] = 0.0;
                }   

        start = timestamp();
        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                        for(k = 0; k < dimension; k++)
                                C[dimension*i+j] += A[dimension*i+k] *
                                        B[dimension*k+j];

        end = timestamp();
        printf("\nsecs:%f\n", end-start);

        free(A);
        free(B);
        free(C);

        return 0;
}

解决方案

Here's my wild guess: cache

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That's much larger than your L2 cache.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

这篇关于矩阵乘法:矩阵规模小的差别,在计时差异较大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆