cblas gemm时间依赖于输入矩阵值 - Ubuntu 14.04 [英] cblas gemm time dependent on input matrix values - Ubuntu 14.04

查看:906
本文介绍了cblas gemm时间依赖于输入矩阵值 - Ubuntu 14.04的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我扩展



问题:什么可能会导致问题的发生? cblas_sgemm调用背后的原因对于具有大量零个的矩阵比对同一个cblas_sgemm调用稠密矩阵花费的时间少得多?



我知道gemv是为矩阵向量乘法设计的,但是如果花费更少的时间,特别是对于稀疏矩阵,为什么我不能使用gemm进行向量矩阵乘法



下面给出了一个简短的代表性代码。它要求输入一个值,然后用该值填充一个向量。然后它用其索引替换每个第32个值。所以,如果我们输入'0',那么我们得到一个稀疏向量,但对于其他值我们得到一个密集的向量。

  #include < iostream> 
#include< stdio.h>
#include< time.h>
#include< cblas.h>

using namespace std;

int main()
{
const int m = 5000;

timespec blas_start,blas_end;
long totalnsec; // total nano sec
double totalsec,totaltime;
int i,j;
float * A = new float [m]; // 1 x m
float * B = new float [m * m]; // m x m
float * C = new float [m]; // 1 x m

float输入;
cout<< 输入一个值来填充向量(0为稀疏);
cin>>输入; //为稀疏输入0

//输入martix A:每第32个元素为非零,其余的值= input
for(i = 0; i {
A [i] = input;
if(i%32 == 0)//调整稀疏度
A [i] = i;
}

//输入矩阵B:单位矩阵
for(i = 0; i for(j = 0; ; m; j ++)
B [i * m + j] =(i == j);

clock_gettime(CLOCK_REALTIME,& blas_start);
cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,1,m,m,1.0f,A,m,B,m,0.0f,C,m)
clock_gettime(CLOCK_REALTIME,& blas_end);

/ * for(i = 0; i printf(%f,C [i]);
printf(\\\
\\\
); * /

//打印时间
totalsec =(double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec <0)
{
totalnsec + = 1e9;
totalsec - = 1;
}
totaltime = totalsec +(double)totalnsec * 1e-9;
cout<<Duration =<总时间< \\\
;

return 0;
}

我在Ubuntu 14.04中使用blas 3.0运行它

  erisp @ ubuntu:〜/ uas / stackoverflow $ g ++ gemmcomp.cpp -o gemmcomp.o -lblas 
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入一个值来填充向量(0为稀疏)5
Duration = 0.0291558
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp .o
输入一个值来填充向量(0为稀疏)0
Duration = 0.000959521

EDIT



如果我用下面的gemv调用替换gemm调用,那么矩阵稀疏性不重要

  // cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,1,m,m,1.0f,A,m,B,m,0.0f,C ,m); 
cblas_sgemv(CblasRowMajor,CblasNoTrans,m,m,1.0f,B,m,A,1,0.0f,C,1)

结果

  erisp @ ubuntu:〜/ uas / stackoverflow $ g ++ gemmcomp.cpp -o gemmcomp.o -lblas 
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入值填充向量(0为稀疏)5
Duration = 0.0301581
erisp @ ubuntu:〜/ uas / stackoverflow $ ./gemmcomp.o
输入一个值来填充向量对于稀疏)0
Duration = 0.0299282

但问题是我试图优化别人的代码使用cublas和他成功和有效使用gemm执行此向量矩阵乘法。因此,我必须对其进行测试或明确证明此调用不正确



EDIT >

我现在甚至通过使用

更新了我的blas库。

  sudo apt-get安装libblas-dev liblapack-dev 

编辑:执行以下命令,

  erisp @ ubuntu:〜/ uas / stackoverflow $ ll -d / usr / lib / libblas * / etc /alternatives/libblas.* 
lrwxrwxrwx 1 root root 26مارچ13 2015 /etc/alternatives/libblas.a - > /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27مارچ13 2015 /etc/alternatives/libblas.so - > /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3 - > /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3gf - > /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2根root 4096مارچ13 2015 / usr / lib / libblas /
lrwxrwxrwx 1 root root 27مارچ13 2015 / usr / lib / libblas.a - > /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28مارچ13 2015 /usr/lib/libblas.so - > /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30مارچ13 2015 /usr/lib/libblas.so.3 - > /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32مارچ13 2015 /usr/lib/libblas.so.3gf - > /etc/alternatives/libblas.so.3gf

erisp @ ubuntu:〜/ uas / stackoverflow $ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3(0xb765e000)
libstdc ++。so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6(0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6(0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6(0xb7381000)
/lib/ld-linux.so.2(0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1(0xb7364000)


解决方案


问题:与对同一个cblas_sgemm调用稠密矩阵相比,cblas_sgemm调用背后的原因花费的时间少得多。 >

看来,由 Ubuntu 14.04(可能还有其他Ubuntu发行版)的默认libblas-dev软件包包括某些矩阵元素为零的情况下的优化。



对于Ubuntu 14.04,BLAS(和cblas)实现/软件包的源代码可以从此处



解压缩后,我们有一个 cblas /包含 cblas API的src 目录,我们还有另一个包含F77的 src 目录实现各种blas例程。



在cblas_sgemm的情况下,当指定参数 CblasRowMajor 时, cblas / src / cblas_sgemm.c 代码将调用下面的fortran例程如下:

  void cblas_sgemm枚举CBLAS_ORDER Order,const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB,const int M,const int N,
const int K,const float alpha,const float * A,
const int lda ,const float * B,const int ldb,
const float beta,float * C,const int ldc)
{
...
} else if(Order == CblasRowMajor )
...
F77_sgemm(F77_TA,F77_TB,& F77_N,& F77_M,& F77_K,& alpha,B,& F77_ldb,A,& F77_lda,& C,& F77_ldc);

请注意,对于此行主调用, A B 矩阵在传递到 F77_sgemm 例程时是相反的。这是明智的,但我不会深入了解为什么在这里。注意到在fortran调用/代码中 A 已变为 B ,并且 B



当我们检查<$ c中的相应fortran例程时, $ c> src / sgemm.f ,我们看到以下代码序列:

  * 
*开始操作。
*
IF(NOTB)THEN
IF(NOTA)THEN
*
* Form C:= alpha * A * B + beta * C。
*
DO 90 J = 1,N
如果(BETA.EQ.ZERO)THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
ELSE IF(BETA.NE.ONE)THEN
DO 60 I = 1,M
C(I,J)= BETA * C J)
60继续
结束IF
DO 80 L = 1,K
IF(B(L,J).NEZERO)THEN *** OPTIMIZATION
TEMP = ALPHA * B(L,J)
DO 70 I = 1,M
C(I,J)= C(I,J)+ TEMP * A $ b 70继续
结束如果
80继续
90继续

上面是代码的一部分,处理没有转置 A 和没有转置 B (这是真正的这个cblas行主测试用例)。在我已添加注释 *** OPTIMIZATION 的循环处理矩阵行/列乘法运算。特别地,如果矩阵元素 B(L,J)为零,则跳过在行70处的DO循环关闭。但是请记住 B 这里对应于传递给 cblas_sgemm A c>例程。



跳过这个do-loop允许以这种方式实现的sgemm函数对于<$ c $中有大量零的情况



实验性地将 A 矩阵传递给 cblas_sgemm ,不是所有的blas实现都有这个优化。在完全相同的平台上测试,但是使用 libopenblas-dev 而不是 libblas-dev 没有提供这样的加速,即基本上没有执行时间差,当 A 矩阵大部分为零,而不是的情况。



注意,这里的fortran(77)代码似乎与 sgemm.f 例程的旧版本类似或相同,例如此处。我发现的这个fortran例程的新版本不包含这种优化,例如此处


This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!

Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?

I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices

A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>

using namespace std;

int main()
{
const int m = 5000; 

timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m

float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse

// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{       
A[i] = input;
if( i % 32 == 0)    //adjust for sparsity
        A[i] = i;
}

// input matrix B: identity matrix
for(i = 0; i < m; i++)
        for(j = 0; j < m; j++)
            B[i*m + j] = (i==j);

clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);

/* for(i = 0; i < m; i++)
        printf("%f ", C[i]);
printf("\n\n");    */

// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";

return 0;
}

I run it as follows in Ubuntu 14.04 with blas 3.0

erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521

EDIT

If I replace the gemm call with the following gemv call then matrix sparsity does not matter

//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);

Results

erisp@ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp@ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282

But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect

EDIT

I have even updated my blas library today by using

sudo apt-get install libblas-dev liblapack-dev

EDIT: Executed the following commands as suggested by different contributors

erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root   26 مارچ  13  2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ  13  2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root   28 مارچ  13  2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root   30 مارچ  13  2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root   32 مارچ  13  2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf

erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
    linux-gate.so.1 =>  (0xb76f6000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
     libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
     libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
     libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)

解决方案

Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?

It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.

For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.

After unpacking that archive, we have a cblas/src directory that contains the cblas API, and we have another src directory that contains F77 implementations of various blas routines.

In the case of cblas_sgemm, when the parameter CblasRowMajor is specified, the cblas/src/cblas_sgemm.c code will call the underlying fortran routine as follows:

void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
             const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
             const int K, const float alpha, const float  *A,
             const int lda, const float  *B, const int ldb,
             const float beta, float  *C, const int ldc)
{
...
} else if (Order == CblasRowMajor)
...
  F77_sgemm(F77_TA, F77_TB, &F77_N, &F77_M, &F77_K, &alpha, B, &F77_ldb, A, &F77_lda, &beta, C, &F77_ldc);

Note that for this row major call, the order of the A and B matrices are reversed when passed to the F77_sgemm routine. This is sensible but I won't delve into why here. It's sufficient to note that A has become B in the fortran call/code, and B has become A.

When we inspect the corresponding fortran routine in src/sgemm.f, we see the following sequence of code:

*
*     Start the operations.
*
      IF (NOTB) THEN
          IF (NOTA) THEN
*
*           Form  C := alpha*A*B + beta*C.
*
              DO 90 J = 1,N
                  IF (BETA.EQ.ZERO) THEN
                      DO 50 I = 1,M
                          C(I,J) = ZERO
   50                 CONTINUE
                  ELSE IF (BETA.NE.ONE) THEN
                      DO 60 I = 1,M
                          C(I,J) = BETA*C(I,J)
   60                 CONTINUE
                  END IF
                  DO 80 L = 1,K
                      IF (B(L,J).NE.ZERO) THEN        ***OPTIMIZATION
                          TEMP = ALPHA*B(L,J)
                          DO 70 I = 1,M
                              C(I,J) = C(I,J) + TEMP*A(I,L)
   70                     CONTINUE
                      END IF
   80             CONTINUE
   90         CONTINUE

The above is the section of code that handles the case where No transpose of A and No transpose of B are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note ***OPTIMIZATION. In particular, if the matrix element B(L,J) is zero, then the DO-loop closing at line 70 is skipped. But remember B here corresponds to the A matrix passed to the cblas_sgemm routine.

The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the A matrix passed to cblas_sgemm when row-major is specified.

Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using libopenblas-dev instead of libblas-dev provides no such speed-up, i.e. essentially no execution time difference when the A matrix is mostly zeroes, vs. the case when it is not.

Note that the fortran (77) code here appears to be similar to or identical to older published versions of the sgemm.f routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.

这篇关于cblas gemm时间依赖于输入矩阵值 - Ubuntu 14.04的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆