减少CUDA矩阵的行 [英] Reduce matrix rows with CUDA

查看:196
本文介绍了减少CUDA矩阵的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Windows 7, NVidia GeForce 425M.

我写了一个简单的CUDA code,其计算矩阵的行总和。
该矩阵具有单维的再presentation(指针浮动)。

I wrote a simple CUDA code which calculates the row sums of a matrix. The matrix has uni-dimensional representation (pointer to a float).

code的串行版本低于(它有 2 循环,如预期):

The serial version of code is below (it has 2 loops, as expected):

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

在CUDA code,我调用内核函数席卷由行矩阵。下面,内核调用片段:

Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

和内核函数执行行的平行总和(还有 1 循环):

and the kernel function which performs the parallel sum of the rows (still has 1 loop):

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

到目前为止好。串行和并行(CUDA)的结果是相等的。

So far so good. The serial and parallel (CUDA) results are equal.

整个的一点是,CUDA版本几乎要花上两倍的串行一个计算的时候,即使我修改 nThreadsPerBlock 参数:我测试 nThreadsPerBlock 32 1024 (允许每个块的最大线程数我的卡)。

The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).

IMO,矩阵尺寸大到足以证明并行化: 90000×1,000

IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.

下面,我用不同的 nThreadsPerBlock 报告经过了串行和并行版本的时间。时间报道毫秒在平均 100 样本:

Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:

矩阵:nrow = 90000 X NCOL = 1000

Matrix: nrow = 90000 x ncol = 1000

编号:毫秒每次采样经过的平均时间( 100 样本): 289.18

Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.

CUDA( 32 ThreadsPerBlock):在毫秒每次采样经过的平均时间( 100 样本): 497.11

CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.

CUDA( 1024 ThreadsPerBlock):在毫秒每次采样经过的平均时间( 100 样本): 699.66

CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.

以防万一,该版本 32 / 1024 nThreadsPerBlock 是最快/最慢的。

Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.

据我了解,有一个从主机复制到设备和周围的其他方式,当一种开销,但也许是缓慢的,因为我不执行最快的code。

I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.

由于我远不是一个CUDA专家:

Since I am far from being a CUDA expert:

我是不是编码此任务的最快的版本?我怎么能提高我的code?
我可以摆脱在内核函数循环的?

有什么想法AP preciated。

Any thoughts appreciated.

虽然我描述了一个标准的 rowSum ,我感兴趣的是 / 行操作,它具有(0; 1} 值,就像罗万德 / rowOR 。这就是说,它不允许我利用这个 CUBLAS 1 COL 列向量把戏,正如一些论者建议。

Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.

由于建议用户其他用户在这里赞同:

As suggest by users other users and here endorsed:

忘了试着编写自己的函数,使用推力库,而不是和魔来了。

FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

推荐答案

。我会尽量在这里给3的方法。内核的方法可以具有最高的性能。推力的方法是最容易实现。 CUBLAS方法仅适用于总和,并有不错的表现。

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.

这是一个非常好的文档介绍如何优化标准并行减少。标准降低可分为两个阶段。

Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.


  1. 各减少了一部分数据的多个线程块;

  2. 一个线程块从第一阶段的结果,减少了最终的1元。

有关您的多减少(减少垫行)问题,只有1级就够了。这样做是为了减少每个线程块1一行。对于更进一步的考虑像每个线程块多行或每多个线程块1列,可以参考的由@Novak 提供的纸张。这可能会提高性能的更多,尤其是对糟糕矩阵。

For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by @Novak. This may improve the performance more, especially for matrices with bad shape.

一般多减少可以通过在几分钟内完成推力:: reduction_by_key 。你可以在这里找到一些讨论<一个href=\"http://stackoverflow.com/questions/17698969/cuda-thrust-find-index-of-minimum-value-among-multiple-vectors/17718468#17718468\">Determining最少的元素及其与CUDA推力每个矩阵列位置。

General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.

推力:: reduction_by_key 不承担每一行都有相同的长度,所以你会得到的性能损失。另一篇文章<一个href=\"http://stackoverflow.com/questions/14211093/how-to-normalize-matrix-columns-in-cuda-with-max-performance\">How正常化矩阵列CUDA,最大的表现?给出分析推力:: reduction_by_key 和行的总和CUBLAS方法之间的比较。它可能给你的表现有基本的了解。

However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.

行总和/矩阵A的COLS可以被看作是一个矩阵向量乘法,其中向量的元素全部为一。它可以是由以下的matlab code psented重新$ P $

Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.

y = A * ones(size(A,2),1);

其中,是一个行的总和。

CUBLAS libary提供了一个高性能的矩阵向量乘法函数 CUBLAS&LT;吨方式&gt; gemv() 进行此项操作。

cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.

时序结果表明,该程序是慢只有10〜50%,比简单地读取的,一旦所有的元件,其可以被看作是这一操作的性能的理论上限。

Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

这篇关于减少CUDA矩阵的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆