BLAS 是如何获得如此极致的性能的? [英] How does BLAS get such extreme performance?

查看:20
本文介绍了BLAS 是如何获得如此极致的性能的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于好奇,我决定将我自己的矩阵乘法函数与 BLAS 实现进行基准测试......我对结果最不惊讶:

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:

自定义实施,10 次试验1000x1000 矩阵乘法:

Custom Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 15.76542 seconds.

BLAS 实施,10 次试验1000x1000 矩阵乘法:

BLAS Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 1.32432 seconds.

这是使用单精度浮点数.

This is using single precision floating point numbers.

我的实现:

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

我有两个问题:

  1. 鉴于矩阵-矩阵乘法表示:nxm * mxn 需要 n*n*m 次乘法,因此在 1000^3 或 1e9 操作以上的情况下.在我的 2.6Ghz 处理器上,BLAS 怎么可能在 1.32 秒内完成 10*1e9 操作?即使乘法是一个单一的操作并且没有其他操作,它也应该需要大约 4 秒.
  2. 为什么我的实现这么慢?

推荐答案

一个好的起点是伟大的书 矩阵计算编程科学 作者:Robert A. van de Geijn 和 Enrique S. Quintana-Ortí.他们提供免费下载版本.

A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

BLAS分为三个层次:

BLAS is divided into three levels:

  • Level 1 定义了一组仅对向量进行运算的线性代数函数.这些函数受益于矢量化(例如使用 SSE).

  • Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).

2 级函数是矩阵向量运算,例如一些矩阵向量积.这些功能可以按照 Level1 功能来实现.但是,如果您可以提供一个专用的实现,该实现利用一些具有共享内存的多处理器架构,则可以提高此功能的性能.

Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.

第 3 级函数是类似于矩阵-矩阵乘积的运算.同样,您可以根据 Level2 函数来实现它们.但是 Level3 函数对 O(N^2) 数据执行 O(N^3) 操作.因此,如果您的平台具有缓存层次结构,那么如果您提供一个 缓存优化/缓存友好 的专用实现,则可以提高性能.这在书中有很好的描述.Level3 功能的主要提升来自缓存优化.这一提升大大超过了并行性和其他硬件优化带来的第二次提升.

Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

顺便说一句,大多数(甚至全部)高性能 BLAS 实现都不是在 Fortran 中实现的.ATLAS 用 C 实现.GotoBLAS/OpenBLAS 用 C 实现,其性能关键部分用 Assembler 实现.只有 BLAS 的参考实现是在 Fortran 中实现的.然而,所有这些 BLAS 实现都提供了一个 Fortran 接口,因此它可以与 LAPACK 链接(LAPACK 的所有性能都来自 BLAS).

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

优化的编译器在这方面起着次要的作用(对于 GotoBLAS/OpenBLAS,编译器根本无关紧要).

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

恕我直言,没有 BLAS 实现使用 Coppersmith–Winograd 算法或 Strassen 算法等算法.可能的原因是:

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

  • 也许不可能提供这些算法的缓存优化实现(即,你会失去更多,然后你会赢)
  • 这些算法在数值上不稳定.由于 BLAS 是 LAPACK 的计算内核,因此这是不行的.
  • 虽然这些算法在纸面上具有很好的时间复杂度,但大 O 表示法隐藏了一个很大的常数,因此它仅开始适用于非常大的矩阵.

编辑/更新:

BLIS 论文是该主题的新突破性论文.他们写得特别好.我的讲座高性能计算的软件基础"我在他们的论文之后实现了矩阵矩阵产品.实际上我实现了矩阵矩阵产品的几个变体.最简单的变体完全用纯 C 语言编写,代码不到 450 行.所有其他变体只是优化循环

The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

矩阵-矩阵乘积的整体性能取决于这些循环.大约 99.9% 的时间都花在这里.在其他变体中,我使用内在函数和汇编代码来提高性能.您可以在此处查看教程介绍所有变体:

The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

ulmBLAS:GEMM(矩阵-矩阵乘积)教程

结合 BLIS 论文,很容易理解英特尔 MKL 之类的库如何获得这样的性能.以及为什么使用行主要存储还是列主要存储并不重要!

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!

最终的基准测试在这里(我们将项目称为 ulmBLAS):

The final benchmarks are here (we called our project ulmBLAS):

ulmBLAS、BLIS、MKL、openBLAS 和 Eigen 的基准

另一个编辑/更新:

我还写了一些关于如何将 BLAS 用于数值线性代数问题的教程,例如求解线性方程组:

I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:

高性能 LU因式分解

(例如,Matlab 使用此 LU 分解来求解线性方程组.)

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

我希望抽出时间来扩展本教程,以描述和演示如何实现 LU 分解的高度可扩展的并行实现,如 PLASMA.

I hope to find time to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

好的,给你:编码一个缓存优化的并行 LU 分解

P.S.:我也做了一些提高 uBLAS 性能的实验.实际上,提升 uBLAS 的性能非常简单(是的,玩文字游戏 :)):

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

uBLAS 实验.

这里有一个与 BLAZE 类似的项目:

Here a similar project with BLAZE:

BLAZE 实验.

这篇关于BLAS 是如何获得如此极致的性能的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆