cblas gemm时间依赖于输入矩阵值 - Ubuntu 14.04 [英] cblas gemm time dependent on input matrix values - Ubuntu 14.04
问题描述
这是我扩展
问题:什么可能会导致问题的发生? cblas_sgemm调用背后的原因对于具有大量零个的矩阵比对同一个cblas_sgemm调用稠密矩阵花费的时间少得多?
我知道gemv是为矩阵向量乘法设计的,但是如果花费更少的时间,特别是对于稀疏矩阵,为什么我不能使用gemm进行向量矩阵乘法
下面给出了一个简短的代表性代码。它要求输入一个值,然后用该值填充一个向量。然后它用其索引替换每个第32个值。所以,如果我们输入'0',那么我们得到一个稀疏向量,但对于其他值我们得到一个密集的向量。
#include < iostream>
#include< stdio.h>
#include< time.h>
#include< cblas.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start,blas_end;
long totalnsec; // total nano sec
double totalsec,totaltime;
int i,j;
float * A = new float [m]; // 1 x m
float * B = new float [m * m]; // m x m
float * C = new float [m]; // 1 x m
float输入;
cout<< 输入一个值来填充向量(0为稀疏);
cin>>输入; //为稀疏输入0
//输入martix A:每第32个元素为非零,其余的值= input
for(i = 0; i {
A [i] = input;
if(i%32 == 0)//调整稀疏度
A [i] = i;
}
//输入矩阵B:单位矩阵
for(i = 0; i for(j = 0; ; m; j ++)
B [i * m + j] =(i == j);
clock_gettime(CLOCK_REALTIME,& blas_start);
cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,1,m,m,1.0f,A,m,B,m,0.0f,C,m)
clock_gettime(CLOCK_REALTIME,& blas_end);
/ * for(i = 0; i printf(%f,C [i]);
printf(\\\
\\\
); * /
//打印时间
totalsec =(double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec <0)
{
totalnsec + = 1e9;
totalsec - = 1;
}
totaltime = totalsec +(double)totalnsec * 1e-9;
cout<<Duration =<总时间< \\\
;
return 0;
}
我在Ubuntu 14.04中使用blas 3.0运行它
erisp @ ubuntu:〜/ uas / stackoverflow $ g ++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入一个值来填充向量(0为稀疏)5
Duration = 0.0291558
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp .o
输入一个值来填充向量(0为稀疏)0
Duration = 0.000959521
EDIT
如果我用下面的gemv调用替换gemm调用,那么矩阵稀疏性不重要
// cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,1,m,m,1.0f,A,m,B,m,0.0f,C ,m);
cblas_sgemv(CblasRowMajor,CblasNoTrans,m,m,1.0f,B,m,A,1,0.0f,C,1)
结果
erisp @ ubuntu:〜/ uas / stackoverflow $ g ++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入值填充向量(0为稀疏)5
Duration = 0.0301581
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入一个值来填充向量对于稀疏)0
Duration = 0.0299282
但问题是我试图优化别人的代码使用cublas和他成功和有效使用gemm执行此向量矩阵乘法。因此,我必须对其进行测试或明确证明此调用不正确。
EDIT >
我现在甚至通过使用
更新了我的blas库。 sudo apt-get安装libblas-dev liblapack-dev
编辑:执行以下命令,
erisp @ ubuntu:〜/ uas / stackoverflow $ ll -d / usr / lib / libblas * / etc /alternatives/libblas.*
lrwxrwxrwx 1 root root 26مارچ13 2015 /etc/alternatives/libblas.a - > /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27مارچ13 2015 /etc/alternatives/libblas.so - > /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3 - > /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3gf - > /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2根root 4096مارچ13 2015 / usr / lib / libblas /
lrwxrwxrwx 1 root root 27مارچ13 2015 / usr / lib / libblas.a - > /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28مارچ13 2015 /usr/lib/libblas.so - > /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30مارچ13 2015 /usr/lib/libblas.so.3 - > /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32مارچ13 2015 /usr/lib/libblas.so.3gf - > /etc/alternatives/libblas.so.3gf
erisp @ ubuntu:〜/ uas / stackoverflow $ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3(0xb765e000)
libstdc ++。so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6(0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6(0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6(0xb7381000)
/lib/ld-linux.so.2(0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1(0xb7364000)
问题:与对同一个cblas_sgemm调用稠密矩阵相比,cblas_sgemm调用背后的原因花费的时间少得多。 >
看来,由 Ubuntu 14.04(可能还有其他Ubuntu发行版)的默认libblas-dev软件包包括某些矩阵元素为零的情况下的优化。
对于Ubuntu 14.04,BLAS(和cblas)实现/软件包的源代码可以从此处。
解压缩后,我们有一个 cblas /包含
目录,我们还有另一个包含F77的 cblas
API的src src
目录实现各种blas例程。
在cblas_sgemm的情况下,当指定参数 CblasRowMajor
时, cblas / src / cblas_sgemm.c
代码将调用下面的fortran例程如下:
void cblas_sgemm枚举CBLAS_ORDER Order,const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB,const int M,const int N,
const int K,const float alpha,const float * A,
const int lda ,const float * B,const int ldb,
const float beta,float * C,const int ldc)
{
...
} else if(Order == CblasRowMajor )
...
F77_sgemm(F77_TA,F77_TB,& F77_N,& F77_M,& F77_K,& alpha,B,& F77_ldb,A,& F77_lda,& C,& F77_ldc);
请注意,对于此行主调用, A
和
B
矩阵在传递到 F77_sgemm
例程时是相反的。这是明智的,但我不会深入了解为什么在这里。注意到在fortran调用/代码中 A
已变为 B
,并且 B
当我们检查<$ c中的相应fortran例程时, $ c> src / sgemm.f
,我们看到以下代码序列:
*
*开始操作。
*
IF(NOTB)THEN
IF(NOTA)THEN
*
* Form C:= alpha * A * B + beta * C。
*
DO 90 J = 1,N
如果(BETA.EQ.ZERO)THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
ELSE IF(BETA.NE.ONE)THEN
DO 60 I = 1,M
C(I,J)= BETA * C J)
60继续
结束IF
DO 80 L = 1,K
IF(B(L,J).NEZERO)THEN *** OPTIMIZATION
TEMP = ALPHA * B(L,J)
DO 70 I = 1,M
C(I,J)= C(I,J)+ TEMP * A $ b 70继续
结束如果
80继续
90继续
上面是代码的一部分,处理没有转置 A
和没有转置 B
(这是真正的这个cblas行主测试用例)。在我已添加注释 *** OPTIMIZATION
的循环处理矩阵行/列乘法运算。特别地,如果矩阵元素 B(L,J)
为零,则跳过在行70处的DO循环关闭。但是请记住 B
这里对应于传递给 cblas_sgemm $ c $的
A
c>例程。
跳过这个do-loop允许以这种方式实现的sgemm函数对于<$ c $中有大量零的情况
实验性地将 cblas_sgemm
,不是所有的blas实现都有这个优化。在完全相同的平台上测试,但是使用 libopenblas-dev
而不是 libblas-dev
没有提供这样的加速,即基本上没有执行时间差,当 A
矩阵大部分为零,而不是的情况。
注意,这里的fortran(77)代码似乎与 sgemm.f
例程的旧版本类似或相同,例如此处。我发现的这个fortran例程的新版本不包含这种优化,例如此处。
This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices
A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse
// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[i*m + j] = (i==j);
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++)
printf("%f ", C[i]);
printf("\n\n"); */
// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";
return 0;
}
I run it as follows in Ubuntu 14.04 with blas 3.0
erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521
EDIT
If I replace the gemm call with the following gemv call then matrix sparsity does not matter
//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
Results
erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282
But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect
EDIT
I have even updated my blas library today by using
sudo apt-get install libblas-dev liblapack-dev
EDIT: Executed the following commands as suggested by different contributors
erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.
For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.
After unpacking that archive, we have a cblas/src
directory that contains the cblas
API, and we have another src
directory that contains F77 implementations of various blas routines.
In the case of cblas_sgemm, when the parameter CblasRowMajor
is specified, the cblas/src/cblas_sgemm.c
code will call the underlying fortran routine as follows:
void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const float alpha, const float *A,
const int lda, const float *B, const int ldb,
const float beta, float *C, const int ldc)
{
...
} else if (Order == CblasRowMajor)
...
F77_sgemm(F77_TA, F77_TB, &F77_N, &F77_M, &F77_K, &alpha, B, &F77_ldb, A, &F77_lda, &beta, C, &F77_ldc);
Note that for this row major call, the order of the A
and B
matrices are reversed when passed to the F77_sgemm
routine. This is sensible but I won't delve into why here. It's sufficient to note that A
has become B
in the fortran call/code, and B
has become A
.
When we inspect the corresponding fortran routine in src/sgemm.f
, we see the following sequence of code:
*
* Start the operations.
*
IF (NOTB) THEN
IF (NOTA) THEN
*
* Form C := alpha*A*B + beta*C.
*
DO 90 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 60 I = 1,M
C(I,J) = BETA*C(I,J)
60 CONTINUE
END IF
DO 80 L = 1,K
IF (B(L,J).NE.ZERO) THEN ***OPTIMIZATION
TEMP = ALPHA*B(L,J)
DO 70 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
70 CONTINUE
END IF
80 CONTINUE
90 CONTINUE
The above is the section of code that handles the case where No transpose of A
and No transpose of B
are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note ***OPTIMIZATION
. In particular, if the matrix element B(L,J)
is zero, then the DO-loop closing at line 70 is skipped. But remember B
here corresponds to the A
matrix passed to the cblas_sgemm
routine.
The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the A
matrix passed to cblas_sgemm
when row-major is specified.
Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using libopenblas-dev
instead of libblas-dev
provides no such speed-up, i.e. essentially no execution time difference when the A
matrix is mostly zeroes, vs. the case when it is not.
Note that the fortran (77) code here appears to be similar to or identical to older published versions of the sgemm.f
routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.
这篇关于cblas gemm时间依赖于输入矩阵值 - Ubuntu 14.04的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!