MATLAB中的矩阵乘法时间复杂度 [英] Matrix multiplication time complexity in MATLAB

查看:531
本文介绍了MATLAB中的矩阵乘法时间复杂度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道MATLAB使用哪种算法进行矩阵乘法吗?它的时间复杂度是多少?

解决方案

出于完整性考虑-如此线程,Matlab使用 DGEMM ( BLAS (基本线性代数子程序)中的例程.

请注意,没有BLAS的单一实现-它针对特定的处理器体系结构进行了调整.因此,如果不找出正在使用哪个版本的BLAS,就无法绝对确定计算机上正在使用哪种算法.

BLAS规范指定了每个子例程的输入和输出,并为每个子例程的输出提供了可接受的误差范围.只要遵循规范,实现就可以自由使用他们喜欢的任何算法.

BLAS的参考实现在其中使用块矩阵乘法算法 DGEMM具有时间复杂度O( n ^ 3),用于将两个 n x n 矩阵相乘.我认为可以合理地假设,大多数BLAS实现将或多或少地遵循参考实现.

请注意,它不使用朴素矩阵乘法算法

for i = 1:N
    for j = 1:N
        for k = 1:N
            c(i,j) = c(i,j) + a(i,k) * b(k,j);
        end
    end
end

这是因为,通常,整个矩阵都无法放入本地内存.如果不断将数据移入和移出本地内存,该算法将变慢.块矩阵算法将运算分为几个小块,这样每个块都足够小以适合本地内存,从而减少了移入和移出内存的次数.

存在渐近更快的矩阵乘法算法,例如 Strassen算法 Coppersmith-Winograd算法,其速率比O( n ^ 3).但是,它们通常不了解高速缓存,并且会忽略局部性-这意味着数据需要不断在内存中分流,因此对于大多数现代体系结构而言,总体算法实际上比优化的块矩阵乘法算法要慢.

Wikipedia注意到Strassen算法可以在单核CPU上提供大于几千的矩阵大小的加速,但是该加速可能约为10%左右,并且BLAS的开发人员可能认为这不值得对于这种罕见的情况(例如, 1996年的这篇论文声称,当 n 超过200时,速度比DGEMM提高了10%左右-尽管我不知道那是多么过时了.另一方面,Coppersmith-Winograd算法仅对大型矩阵提供了优势,以至于现代硬件无法对其进行处理".

因此,答案是Matlab使用幼稚但高效且具有缓存意识的算法来获得快速的矩阵乘法.


我通过创建一些视频来更新此答案,这些视频演示了与朴素算法相比,块矩阵乘法算法的局部性.

在以下每个视频中,我们正在可视化两个8x8矩阵的乘积 A B 以创建乘积 C = A x B .黄色突出显示指示在算法的每个步骤中正在处理每个矩阵 A B C 中的哪个元素.您可以看到块矩阵乘法如何一次仅对矩阵的小块起作用,并多次重复使用每个块,从而使必须将数据移入和移出本地内存的次数最小化

Does anyone know which algorithm MATLAB uses for matrix multiplication and what is its time complexity?

解决方案

For completeness -- as mentioned in this thread, Matlab uses the DGEMM (Double GEneral Matrix Multiplication) routine from BLAS (Basic Linear Algebra Subprograms).

Note that there is not one single implementation of BLAS - it is tuned for particular processor architectures. Therefore you cannot be absolutely certain which algorithm is being used on your machine without finding out which version of BLAS is in use.

The specification for BLAS specifies the inputs and output of each subroutine, and provides acceptable error bounds for the output of each subroutine. Implementations are free to use whatever algorithm they like, as long they follows the specification.

The reference implementation of BLAS uses a block matrix multiplication algorithm in DGEMM that has time complexity O(n^3) for multiplying two n x n matrices. I think it's reasonable to assume that most implementations of BLAS will more or less follow the reference implementation.

Note that it doesn't use the naive matrix multiplication algorithm

for i = 1:N
    for j = 1:N
        for k = 1:N
            c(i,j) = c(i,j) + a(i,k) * b(k,j);
        end
    end
end

This is because, typically, the entire matrix will not fit in local memory. If data is constantly being shifted into and out of local memory, the algorithm will slow down. The block matrix algorithm breaks the operation into small blocks, such that each block is small enough to fit into local memory, reducing the number of shifts into and out of memory.

There exist asymptotically faster matrix multiplication algorithms, eg the Strassen algorithm or the Coppersmith-Winograd algorithm which have a slightly faster rate than O(n^3). However, they are generally not cache aware and ignore locality - meaning that data continually needs to be shunted around in memory, so for most modern architectures the overall algorithm is actually slower than an optimized block matrix multiplication algorithm.

Wikipedia notes that the Strassen algorithm may provide speedups on a single core CPU for matrix sizes greater than several thousand, however the speedup is likely to be around 10% or so, and the developers of BLAS probably don't consider it worthwhile for this rare case (saying that, this paper from 1996 claims a speed increase of around 10% over DGEMM for n above about 200 -- though I don't know how out of date that is). The Coppersmith-Winograd algorithm, on the other hand, "only provides an advantage for matrices so large that they cannot be processed by modern hardware".

So the answer is that Matlab uses a naive, but efficient and cache-aware algorithm to get its blazing fast matrix multiplication.


I updated this answer by creating some videos that demonstrate the locality of the block matrix multiplication algorithm, compared to the naive algorithm.

In each of the following videos, we are visualizing the multiplication of two 8x8 matrices A and B to create the product C = A x B. The yellow highlight indicates which element in each of the matrices A, B and C is being processed at each step of the algorithm. You can see how the block matrix multiplication only works on small blocks of the matrix at a time, and re-uses each of those blocks multiple times, so that the number of times that data must be shifted in and out of local memory is minimized.

这篇关于MATLAB中的矩阵乘法时间复杂度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆