矩阵稀疏性对cblas sgemm在Ubuntu 14.04的影响 [英] Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

查看:647
本文介绍了矩阵稀疏性对cblas sgemm在Ubuntu 14.04的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现,如果矩阵在它们中具有大数量的零,则用于矩阵乘法的cblas_sgemm调用的性能显着地提高。它提高到它打败它的cublas表弟大约100倍。这很可能是由于某些自动检测稀疏性以及由cblas_sgemm函数进行的适当格式转换。



不幸的是,它的cuda副本(即cublasSgemm)没有展示此类行为。



因此,问题是,如何在cublasSgemm上对可能具有大量零的矩阵进行相同类型的优化。



cblas_sgemm使用什么技术来自动调整稀疏矩阵?



请不要推荐cuSparse / CUSP等,因为


  1. 我不太清楚输入矩阵的稀疏性。

  2. 对于初始几次迭代,矩阵可能稀疏,但随着时间的推移逐渐变得密集。

预先感谢



已编辑为包含可重现上述情况的代码

  #include< iostream> 
#include< stdio.h>
#include< time.h>
#include< cblas.h>
#include< cublas_v2.h>
using namespace std;

int main()
{
const int m = 5000;

timespec blas_start,blas_end,cublas_start,cublas_end;
long totalnsec; // total nano sec
double totalsec,totaltime;
int i,j;
float * A = new float [m]; // 1 x m
float * B = new float [m * m]; // m x m
float * C = new float [m]; // 1 xm

//输入martix A:每第32个元素非零
for(i = 0; i {
A [i] = 0;
if(i%32 == 0)//调整稀疏度
A [i] = i;
}

//输入矩阵B:单位矩阵
// col major = row major
for(i = 0; i for(j = 0; j {
if(i == j)
B [j * m +
else
B [j * m + i] = 0;
}

clock_gettime(CLOCK_REALTIME,& blas_start);
cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,1,m,m,1,A,m,B,m,0,C,m)
clock_gettime(CLOCK_REALTIME,& blas_end);
/ *
for(i = 0; i <12; i ++)
printf(%f,C [i]);
* /

// cublas section

cudaError_t cudaStat;
cublasHandle_t handle;
cublasCreate(& handle);
//声明设备变量
float * A_d,* B_d,* C_d;

//为设备变量分配内存
cudaStat = cudaMalloc(& A_d,sizeof(float)* m);
if(cudaStat!= cudaSuccess)printf(为A_d \\\
分配内存);

cudaStat = cudaMalloc(& B_d,sizeof(float)* m * m);
if(cudaStat!= cudaSuccess)printf(为B_d \\\
分配内存);

cudaStat = cudaMalloc(& C_d,sizeof(float)* m);
if(cudaStat!= cudaSuccess)printf(为C_d \\\
分配内存);

//将A,B的值移动到设备变量
cublasSetVector(m,sizeof(float),A,1,A_d,1);
cublasSetMatrix(m,m,sizeof(float),B,m,B_d,m);

//实际乘法运算
float alpha = 1.0f,beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME,& cublas_start);

cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,1,m,m,& alpha,A_d,1,B_d,m,& beta,C_d,1)

cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME,& cublas_end);

cublasGetVector(m,sizeof(float),C,1,C_d,1);
/ *
for(i = 0; i <12; i ++)
printf(%f,C [i]);
* /

//打印时间
// blas时间
totalsec =(double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec <0)
{
totalnsec + = 1e9;
totalsec - = 1;
}
totaltime = totalsec +(double)totalnsec * 1e-9;
cout<<BLAS Time =<总时间< \\\
;

// cublas
totalsec =(double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec <0)
{
totalnsec + = 1e9;
totalsec - = 1;
}
totaltime = totalsec +(double)totalnsec * 1e-9;
cout<<CUBLAS Time =<<总时间< \\\
;

return 0;
}

转到下面的结果

  malang @ ubuntu:〜/ uas / stackoverflow $ nvcc -arch = sm_12 blascomp.cu -o blascompoo -lblas -lcublas 
malang @ ubuntu: 〜/ uas / stackoverflow $ ./blascomp.o
BLAS时间= 0.000964504
CUBLAS时间= 0.0365322


b $ b

EDIT

b

使用cublasSgemv大大提高了GPU的性能。但是,我仍然有这个问题cblas_sgemm对于稀疏矩阵更高效的CPU。可能的原因是什么?



EDIT根据@Eric @osgx @Robert Crovella的建议执行以下命令

  erisp @ ubuntu:〜/ uas / stackoverflow $ ldd ./gemmcomp.o 
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3(0xb765e000)
libstdc ++。so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6(0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6(0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6(0xb7381000)
/lib/ld-linux.so.2(0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1(0xb7364000)

erisp @ ubuntu:〜/ uas / stackoverflow $ ll -d / usr / lib / libblas * / etc / alternatives / libblas。*
lrwxrwxrwx 1 root root 26مارچ13 2015 /etc/alternatives/libblas.a - > /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27مارچ13 2015 /etc/alternatives/libblas.so - > /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3 - > /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29مارچ13 2015 /etc/alternatives/libblas.so.3gf - > /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2根root 4096مارچ13 2015 / usr / lib / libblas /
lrwxrwxrwx 1 root root 27مارچ13 2015 / usr / lib / libblas.a - > /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28مارچ13 2015 /usr/lib/libblas.so - > /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30مارچ13 2015 /usr/lib/libblas.so.3 - > /etc/alternatives/libblas.so.3
lrwxrwxrwx 1根根32 2015年6月/usr/lib/libblas.so.3gf - > /etc/alternatives/libblas.so.3gf


解决方案

代码有一个问题 - 你使用错误的BLAS API。使用矩阵矩阵乘法例程 gemm()执行向量矩阵乘法运算。



对于vec-mat-mul或mat-vec-mul,你应该使用 gemv()。当然 gemm()可以给出一个只有1行的矩阵的正确结果。但是这是一个意想不到的角落情况, gemv()应该处理,
,所以你可能无法获得GPU和/或CPU的峰值性能。



您可以改为 gemv()并再次进行基准测试。






EDIT



这是我的单线程MKL的基准测试结果。 A B 的值与您的代码中的值相同。我无法在CPU上重现您的结果'0.000964504s'。您可以检查结果向量的正确性。



使用 gemm()

  BLAS时间= 0.0169784 
CUBLAS时间= 0.00356155

使用 gemv()

 时间= 0.0167557 
CUBLAS时间= 0.0013809

EDIT2 p>

我现在可以使用包 libblas-dev 在unbuntu 14.04上重现FAST结果。



原因在以下问题中得到解答。



cblas gemm取决于输入矩阵值的时间 - Ubuntu 14.04



在BLAS的特定版本中,有代码检查零元素。检查成本是O(n ^ 2),因此值得在成本为O(n ^ 3)的矩阵矩阵乘法上进行。



对于GPU gemm (),因为计算的顺序是不同的(逐块而不是逐行),这样的优化可能不可行。但它是可行的GPU gemv(),其中花费在从全局内存加载矩阵的时间可以节省。


I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function.

Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm.

So, the question is, how can I get the same type of optimization on cublasSgemm for matrices that may have a large number of zeros.

And what technique does cblas_sgemm use to automatically adjust to sparse matrices?

Please, do not recommend cuSparse / CUSP etc because

  1. I am not sure about the sparsity of input matrices beforehand
  2. I am working on an iterative algorithm where for initial few iterations the matrices may be sparse but gradually become dense as time goes on.

Thanks in advance

Edited to include code to reproduce the above scenario

#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;

int main()
{
const int m = 5000; 

timespec blas_start, blas_end, cublas_start, cublas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m

// input martix A: every 32nd element is non-zero
for(i = 0; i < m; i++)
{       
    A[i] = 0;
    if( i % 32 == 0)    //adjust for sparsity
        A[i] = i;
}

// input matrix B: identity matrix
// col major = row major
for(i = 0; i < m; i++)
    for(j = 0; j < m; j++)
    {
        if (i==j)           
            B[j*m + i] = 1;
        else
            B[j*m + i] = 0;     
    }

clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1, A, m, B, m, 0, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

//cublas section

cudaError_t cudaStat;   
cublasHandle_t handle;
cublasCreate(&handle);
//Declaring Device Variables
float *A_d, *B_d, *C_d;

//Allocating Memory for Device Variables
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");

cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");

cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");

// Moving values of A, B onto Device variables
cublasSetVector(m, sizeof(float), A, 1, A_d, 1);
cublasSetMatrix(m, m, sizeof(float), B, m, B_d, m); 

// Do the actual multiplication
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, m, m, &alpha, A_d, 1, B_d, m, &beta, C_d, 1);

cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);

cublasGetVector(m, sizeof(float), C, 1, C_d, 1);
    /*
for(i = 0; i < 12; i++)
    printf("%f ", C[i]);
    */

// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"BLAS Time = "<< totaltime << "\n";

//cublas
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
    totalnsec += 1e9;
    totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"CUBLAS Time = "<< totaltime << "\n";

return 0;
}

Ran it to get the following results

malang@ubuntu:~/uas/stackoverflow$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
malang@ubuntu:~/uas/stackoverflow$ ./blascomp.o
BLAS Time = 0.000964504
CUBLAS Time = 0.0365322

EDIT

Edited after the answer of @Eric

Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for sparse matrices on the CPU. What could be the possible reasons?

EDIT Executed the following commands on the suggestion of @Eric @osgx @Robert Crovella

erisp@ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
    linux-gate.so.1 =>  (0xb76f6000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)

erisp@ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root   26 مارچ  13  2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root   29 مارچ  13  2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ  13  2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root   27 مارچ  13  2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root   28 مارچ  13  2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root   30 مارچ  13  2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root   32 مارچ  13  2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf

解决方案

Your code has a problem - you are using the wrong BLAS API. You use the matrix-matrix-multiplication routine gemm() to do a vector-matrix-multiplication operation.

For vec-mat-mul or mat-vec-mul you should use gemv(). Of course gemm() can give correct result with a matrix having only 1 row. But this is an unexpected corner case that gemv() should handle, so you may not get the peak performance on GPU and/or CPU.

You could change to gemv() and benchmark again.


EDIT

Here's my benchmark result with single thread MKL. Values of A and B are same as in your code. I cannot reproduce your result of '0.000964504s' on CPU. You could check the correctness of your result vector. There's a chance that your cblas library has a bug.

Using gemm()

BLAS Time = 0.0169784
CUBLAS Time = 0.00356155

Using gemv()

BLAS Time = 0.0167557
CUBLAS Time = 0.0013809

EDIT2

I now can reproduce the FAST result on unbuntu 14.04 with the package libblas-dev.

The reason is answered in the following question.

cblas gemm time dependent on input matrix values - Ubuntu 14.04

In the particular version of BLAS, there's code to check for zero element. The checking cost is O(n^2), so it is worth doing this on matrix-matrix multiplication whose cost is O(n^3).

For GPU gemm(), as the order of computing is different (block by block instead of line by line), such optimization may not be feasible. But it is doable for GPU gemv(), where the time spent on loading the matrix from global memory could be saved.

这篇关于矩阵稀疏性对cblas sgemm在Ubuntu 14.04的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆