矩阵乘法:矩阵大小差异小,时序差异大 [英] Matrix multiplication: Small difference in matrix size, large difference in timings

查看:41
本文介绍了矩阵乘法:矩阵大小差异小,时序差异大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵乘法代码,如下所示:

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

这里,矩阵的大小由dimension表示.现在,如果矩阵的大小为 2000,则运行这段代码需要 147 秒,而如果矩阵的大小为 2048,则需要 447 秒.所以虽然没有区别.乘法的次数是 (2048*2048*2048)/(2000*2000*2000) = 1.073,时间差是 447/147 = 3.有人可以解释为什么会发生这种情况吗?我预计它会线性扩展,这不会发生.我不是想制作最快的矩阵乘法代码,只是想了解它为什么会发生.

Here, the size of the matrix is represented by dimension. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

规格:AMD Opteron 双核节点 (2.2GHz)、2G RAM、gcc v 4.5.0

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

程序编译为gcc -O3 simple.c

我也在 Intel 的 icc 编译器上运行过它,并看到了类似的结果.

I have run this on Intel's icc compiler as well, and seen similar results.

正如评论/答案中所建议的,我运行了维度=2060 的代码,它需要 145 秒.

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

这是完整的程序:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv; 

double timestamp()
{
        double t;
        gettimeofday(&tv, NULL);
        t = tv.tv_sec + (tv.tv_usec/1000000.0);
        return t;
}

int main(int argc, char *argv[])
{
        int i, j, k;
        double *A, *B, *C, start, end;

        A = (double*)malloc(dimension*dimension*sizeof(double));
        B = (double*)malloc(dimension*dimension*sizeof(double));
        C = (double*)malloc(dimension*dimension*sizeof(double));

        srand(292);

        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                {   
                        A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        C[dimension*i+j] = 0.0;
                }   

        start = timestamp();
        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                        for(k = 0; k < dimension; k++)
                                C[dimension*i+j] += A[dimension*i+k] *
                                        B[dimension*k+j];

        end = timestamp();
        printf("
secs:%f
", end-start);

        free(A);
        free(B);
        free(C);

        return 0;
}

推荐答案

这是我的猜测:缓存

可能是您可以将 2 行 2000 double 放入缓存中.略小于 32kb L1 缓存.(同时留出空间其他必要的东西)

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

但是当您将其提高到 2048 时,它会使用整个缓存(并且您会溢出一些缓存,因为您需要为其他东西留出空间)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

假设缓存策略是 LRU,那么一点点溢出缓存会导致整行重复刷新并重新加载到 L1 缓存中.

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

另一种可能性是由于二的幂的缓存关联性.虽然我认为处理器是 2 路 L1 关联的,所以我认为在这种情况下并不重要.(但无论如何我都会把这个想法抛在脑后)

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)

可能的解释 2: 由于 L2 缓存上的超级对齐而导致缓存未命中冲突.

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

您的 B 数组正在列上迭代.所以访问是跨步的.您的总数据大小为 2k x 2k,每个矩阵约为 32 MB.这比您的 L2 缓存大得多.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That's much larger than your L2 cache.

当数据没有完美对齐时,您将在 B 上获得不错的空间局部性.尽管您正在跳跃行并且每个缓存行仅使用一个元素,但缓存行仍保留在 L2 缓存中以供中间的下一次迭代重复使用循环.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

然而,当数据完美对齐时(2048),这些跃点将全部落在相同的缓存方式"上,并且将远远超过您的 L2 缓存关联性.因此,B 访问的缓存行将不会留在缓存中用于下一次迭代.相反,它们需要从 ram 一路拉进来.

However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

这篇关于矩阵乘法:矩阵大小差异小,时序差异大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆