为什么是提升矩阵乘法比我慢? [英] Why is boosts matrix multiplication slower than mine?
问题描述
我已经实现的一个矩阵乘法与的boost ::数字:: uBLAS库::矩阵
(见的我的全力,努力提升code )
I have implemented one matrix multiplication with boost::numeric::ublas::matrix
(see my full, working boost code)
Result result = read ();
boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);
和另一个标准算法(请参阅full标准code ):
vector< vector<int> > ijkalgorithm(vector< vector<int> > A,
vector< vector<int> > B) {
int n = A.size();
// initialise C with 0s
vector<int> tmp(n, 0);
vector< vector<int> > C(n, tmp);
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}
这是我如何测试速度:
time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt
time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt
这两个程序读取硬件codeD文本文件包含两个2000×2000矩阵。
这两个方案都具有这些标志编译:
Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. Both programs were compiled with these flags:
g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic
我的 15的秒。作为我的执行情况,并在4分钟作为升压执行!
I got 15 seconds for my implementation and over 4 minutes for the boost-implementation!
编辑:接下来编译它之后
edit: After compiling it with
g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out
我的28.19秒作为ikj算法和60.99秒作为提振。所以升压仍然是相当慢。
I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. So Boost is still considerably slower.
为什么比我实现提高这么多慢?
Why is boost so much slower than my implementation?
推荐答案
的版本的uBLAS库的性能下降可以通过将所指出的TJD调试后者的特点作出部分解释。
Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD.
下面是调试上采取的差别的版本时间:
Here's the time taken by the uBLAS version with debugging on:
real 0m19.966s
user 0m19.809s
sys 0m0.112s
下面是与调试关闭(加入 -DNDEBUG -DBOOST_UBLAS_NDEBUG
编译器标志)采取的差别的版本时间:
Here's the time taken by the uBLAS version with debugging off (-DNDEBUG -DBOOST_UBLAS_NDEBUG
compiler flags added):
real 0m7.061s
user 0m6.936s
sys 0m0.096s
因此,与调试过,版本的uBLAS库快近3倍。
So with debugging off, uBLAS version is almost 3 times faster.
剩余的性能差异可以通过引用 uBLAS库以下部分进行解释常见问题解答为什么比uBLAS库(atlas-)BLAS这么慢得多:
Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS":
uBLAS库的一个重要的设计目标是要尽可能通用。
An important design goal of ublas is to be as general as possible.
此一般性几乎总是是有代价的。尤其是刺
函数模板可以处理不同类型的矩阵,如稀疏或三角形的。幸运的uBLAS提供的替代品以稠密矩阵乘法优化,尤其 axpy_prod 一>和 block_prod
。以下是比较不同方法的结果:
This generality almost always comes with a cost. In particular the prod
function template can handle different types of matrices, such as sparse or triangular ones. Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod
. Here are the results of comparing different methods:
ijkalgorithm prod axpy_prod block_prod
1.335 7.061 1.330 1.278
正如你可以同时看到 axpy_prod
和 block_prod
是有些比你更快地实现。尺寸仅为计算时间没有I / O,为 block_prod
(我用64)去除不必要的复制和块大小的仔细选择可以使不同的更深刻。
As you can see both axpy_prod
and block_prod
are somewhat faster than your implementation. Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod
(I used 64) can make the difference more profound.
又见的uBLAS常见问题解答和<一个href=\"http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/for_developers/effective_ublas.html\">Effective的uBLAS和一般code优化。
这篇关于为什么是提升矩阵乘法比我慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!