使用CUDA减少矩阵行 [英] Reduce matrix rows with CUDA

查看:231
本文介绍了使用CUDA减少矩阵行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  Windows 7,NVidia GeForce 425M。 

我写了一个简单的CUDA代码来计算矩阵的行和。
该矩阵具有单维表示(指向浮点的指针)。



代码的串行版本如下(它具有 2 循环,如所期望的):

  void serial_rowSum(float * m,float * output,int nrow ,int ncol){
float sum;
for(int i = 0; i sum = 0;
for(int j = 0; j< ncol; j ++)
sum + = m [i * ncol + j]
output [i] = sum;
}
}



在CUDA代码内部,我调用内核函数扫描矩阵按行。下面是内核调用片断:

  dim3 threadsPerBlock((unsigned int)nThreadsPerBlock); //必须是32的倍数
dim3 blocksPerGrid((unsigned int)ceil(nrow /(float)nThreadsPerBlock));

kernel_rowSum<<<< blocksPerGrid,threadsPerBlock>>>(d_m,d_output,nrow,ncol);

和执行行并行求和的内核函数(仍然具有 1 loop):

  __ global__ void kernel_rowSum(float * m,float * s,int nrow, int ncol){

int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

if(rowIdx< nrow){
float sum = 0;
for(int k = 0; k< ncol; k ++)
sum + = m [rowIdx * ncol + k]
s [rowIdx] = sum;
}

}

串行和并行(CUDA)结果是相等的。



总的来说,CUDA版本需要几乎两倍的串行时间来计算, nThreadsPerBlock 参数:我测试 nThreadsPerBlock 32 code> 1024 (我的卡允许的每个块的最大线程数)。



IMO,矩阵维度足够大下面,我报告串行和并行版本使用不同的时间消耗的时间: 90,000 x 1,000


$ b < nThreadsPerBlock 。在 msec 中平均报告 100 样本的时间:



Matrix:nrow = 90000 x ncol = 1000



序列:每个样本平均耗用时间,以毫秒为单位( 100 示例): 289.18



CUDA( 32 ThreadsPerBlock ):每个样品平均耗时(以毫秒为单位)( 100 样品): 497.11



CUDA( 1024 ThreadsPerBlock):每个样本平均耗用时间,以毫秒为单位( 100 样本) : 699.66



为了防万一,版本号为 32 / 1024 nThreadsPerBlock 是最快/最慢的。



我知道从主机复制到设备时有一种开销,反之亦然,但也许缓慢是因为我没有实现最快的代码。



由于我远不是一个CUDA专家:



我是编码这个任务最快的版本吗?如何改进我的代码?
我可以摆脱内核函数中的循环吗?



任何想法都感激。



EDIT 1



虽然我描述了一个标准 rowSum 感兴趣 AND / 操作(0; 1} code>值,如 rowAND / rowOR 这就是说,它不允许我利用 cuBLAS 乘以 1 COL 列向量技巧,



EDIT 2



用户和此处批准:



忘记关于尝试写自己的功能,请使用Thrust库,而魔法来了。

解决方案

因为你提到你需要除sum之外的一般缩减算法,我将尝试给出3种方法。推力方法是最容易实现的。 cuBLAS方法只与sum和工作只有良好的性能。



内核方法



这是一个很好的文档,介绍如何优化标准平行缩减。


  1. 多个线程块每个减少一部分数据;

  2. 一个线程块从第1阶段的结果减少到最后1个元素。

mat)问题,只有阶段1就足够了。这个想法是每个线程块减少1行。对于进一步的考虑,例如多行每线程块或每个多线程块1行,您可以参考由@Novak提供的纸张。这可以更加提高性能,特别是对于具有不良形状的矩阵。



推力方法



一般多重缩减可以通过 thrust :: reduction_by_key 几分钟后。您可以在这里找到一些讨论确定最小元素及其在每个矩阵列中的位置与CUDA Thrust



但是 thrust :: reduction_by_key 不会假定每一行都有相同的长度,另一篇文章如何使用最大性能对CUDA中的矩阵列进行归一化?提供 thrust :: reduction_by_key 和cuBLAS对行总和的比较。



cuBLAS方法



行数/列的总和矩阵A可以被视为矩阵向量乘法,其中向量的元素都是1。它可以由以下matlab代码表示。

  y = A * ones(size(A,2),1) 

其中 y A。



cuBLAS库提供高性能矩阵向量乘法函数 cublas< t> gemv()



计时结果表明,这个程序比简单地读取A的所有元素慢10〜50%,这可以看作是这个操作的性能的理论上限。


Windows 7, NVidia GeForce 425M.

I wrote a simple CUDA code which calculates the row sums of a matrix. The matrix has uni-dimensional representation (pointer to a float).

The serial version of code is below (it has 2 loops, as expected):

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

and the kernel function which performs the parallel sum of the rows (still has 1 loop):

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

So far so good. The serial and parallel (CUDA) results are equal.

The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).

IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.

Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:

Matrix: nrow = 90000 x ncol = 1000

Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.

CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.

CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.

Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.

I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.

Since I am far from being a CUDA expert:

Am I coding the fastest version for this task? How could I improve my code? Can I get rid of the loop in the kernel function?

Any thoughts appreciated.

EDIT 1

Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.

EDIT 2

As suggest by users other users and here endorsed:

FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

解决方案

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.

Kernel Approach

Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.

  1. Multiple thread blocks each reduces one part of the data;
  2. One thread block reduces from result of stage 1 to the final 1 element.

For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by @Novak. This may improve the performance more, especially for matrices with bad shape.

Thrust Approach

General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.

However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.

cuBLAS Approach

Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.

y = A * ones(size(A,2),1);

where y is the sum of rows of A.

cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.

Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

这篇关于使用CUDA减少矩阵行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆