如何在矩阵乘法等操作中避免“代码优化"? [英] How to avoid the 'code optimization' in operations like matrix multiplication?

查看:88
本文介绍了如何在矩阵乘法等操作中避免“代码优化"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较两种算法Alg_x,Alg_y的运行时间.但是,Alg_x包含许多矩阵乘法,Alg_y包含许多逐元素运算(例如,每两个数字/向量的求和和乘法).从理论上讲,Alg_x和Alg_y具有相同的运行时间.但是,实际上,Alg_x的运行速度比Alg_y快得多,因为矩阵乘法是在Matlab中专门设计和优化的.

I need to compare the running time of two algorithms Alg_x, Alg_y. However, Alg_x contains many matrix multiplications, Alg_y contains many element-wise operations (e.g., summation and muliplicaton of each two numbers/vectors). Theoretically, Alg_x and Alg_y have the same running time. However, practically, Alg_x can run much faster than Alg_y, because matrix multiplications have been specially designed and optimized in Matlab.

然后,我的问题是,如何关闭此类代码优化"以公平地比较运行时间并反映理论时间复杂度?

Then, my problem is, how to close such 'code optimization' in order to fairly compare the running time and reflect the theoretical time complexity?

%%%%%  X = randn(1000,2000);

Alg_x

tic;
temp = X*X';
toc

Alg_y

[d,n] = size(X);
temp = zeros(d,d);
tic;
for i =1:n
    x = X(:,i);
    temp = temp+x*x';
end
toc

以上两个代码具有相同的输出,而Alg_x运行更快.而且,在删除 x = X(:,i);之后,Alg_y也将运行得更快. temp = temp + x * x'; ,所以我想正是迭代使得Alg_y运行缓慢.

The above two codes have the same output, while Alg_x runs much faster. Moreover, Alg_y will also run much faster after I remove x = X(:,i); temp = temp+x*x';, so I guess it is the for iteration that makes Alg_y run slow.

我确实想关闭并避免这种优化. 以下是我从中提取的内容:为什么是MATLAB矩阵乘法这么快?

I do want to close and avoid such oprimizations. Below is something I extracted from Why is MATLAB so fast in matrix multiplication?

我正在使用CUDA,C ++,C#和Java进行一些基准测试,并使用MATLAB进行验证和矩阵生成.但是当我使用MATLAB进行乘法运算时,几乎立即将2048x2048甚至更大的矩阵相乘.

I am making some benchmarks with CUDA, C++, C#, and Java, and using MATLAB for verification and matrix generation. But when I multiply with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

只有CUDA具有竞争力,但我认为至少C ++会比较接近,并且速度不会慢60倍.

Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60x slower.

所以我的问题是-MATLAB如何这么快地做到这一点?

So my question is - How is MATLAB doing it that fast?

C ++代码:

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

我也不知道该如何看待C#结果.该算法与C ++和Java相同,但是从1024跳到2048?

I also dont know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024?

Edit2: 更新了MATLAB和4096x4096结果

Updated MATLAB and 4096x4096 results

推荐答案

我在回答您的问题"MATLAB如何这么快?".

I'm answering your question "How is MATLAB doing it that fast?".

MATLAB使用英特尔MKL 进行矩阵乘法.
这是利用所有内核及其向量处理单元( SSE / AVX ).
此外,它针对CPU中的缓存布局进行了手动优化.

MATLAB uses Intel MKL for Matrix Multiplication.
This is highly optimized code taking advantage of all the cores and their Vector Processing Units (SSE / AVX).
Moreover, it is hand tuned optimized for the cache layout in the CPU.

您的代码不会执行此操作,因此会在桌面上留下很多收获.

Your code doesn't do that and hence leaves a lot of gains on the table.

在MATLAB中可能有一种禁用MKL的方法.
尽管到目前为止,我只看到了替换它的方法.

There might be a way to disable MKL in MATLAB.
Though so far I've only seen methods to replace it.

这篇关于如何在矩阵乘法等操作中避免“代码优化"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆