为什么是提升矩阵乘法比我慢? [英] Why is boosts matrix multiplication slower than mine?

查看:498
本文介绍了为什么是提升矩阵乘法比我慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经实现的一个矩阵乘法与的boost ::数字:: uBLAS库::矩阵(见的我的全力,努力提升code

I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code)

Result result = read ();

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);

和另一个标准算法(请参阅full标准code ):

vector< vector<int> > ijkalgorithm(vector< vector<int> > A, 
                                    vector< vector<int> > B) {
    int n = A.size();

    // initialise C with 0s
    vector<int> tmp(n, 0);
    vector< vector<int> > C(n, tmp);

    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

这是我如何测试速度:

time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt

time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt

这两个程序读取硬件codeD文本文件包含两个2000×2000矩阵。
这两个方案都具有这些标志编译:

Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:

g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

我的 15的秒。作为我的执行情况,并在4分钟作为升压执行!

I got 15 seconds for my implementation and over 4 minutes for the boost-implementation!

编辑:接下来编译它之后

edit: After compiling it with

g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

我的28.19秒作为ikj算法和60.99秒作为提振。所以升压仍然是相当慢。

I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. So Boost is still considerably slower.

为什么比我实现提高这么多慢?

Why is boost so much slower than my implementation?

推荐答案

的版本的uBLAS库的性能下降可以通过将所指出的TJD调试后者的特点作出部分解释。

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.

下面是调试上采取的差别的版本时间:

Here's the time taken by the uBLAS version with debugging on:

real    0m19.966s
user    0m19.809s
sys     0m0.112s

下面是与调试关闭(加入 -DNDEBUG -DBOOST_UBLAS_NDEBUG 编译器标志)采取的差别的版本时间:

Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added):

real    0m7.061s
user    0m6.936s
sys     0m0.096s

因此​​,与调试过,版本的uBLAS库快近3倍。

So with debugging off, uBLAS version is almost 3 times faster.

剩余的性能差异可以通过引用 uBLAS库以下部分进行解释常见问题解答为什么比uBLAS库(atlas-)BLAS这么慢得多:

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS":

uBLAS库的一个重要的设计目标是要尽可能通用。

An important design goal of ublas is to be as general as possible.

此一般性几乎总是是有代价的。尤其是函数模板可以处理不同类型的矩阵,如稀疏或三角形的。幸运的uBLAS提供的替代品以稠密矩阵乘法优化,尤其 axpy_prod block_prod 。以下是比较不同方法的结果:

This generality almost always comes with a cost. In particular the prod function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod. Here are the results of comparing different methods:

ijkalgorithm   prod   axpy_prod  block_prod
   1.335       7.061    1.330       1.278

正如你可以同时看到 axpy_prod block_prod 是有些比你更快地实现。尺寸仅为计算时间没有I / O,为 block_prod (我用64)去除不必要的复制和块大小的仔细选择可以使不同的更深刻。

As you can see both axpy_prod and block_prod are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod (I used 64) can make the difference more profound.

又见的uBLAS常见问题解答和<一个href=\"http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/for_developers/effective_ublas.html\">Effective的uBLAS和一般code优化。

这篇关于为什么是提升矩阵乘法比我慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆