使用 CUDA 减少矩阵行 [英] Reduce matrix rows with CUDA

查看:30
本文介绍了使用 CUDA 减少矩阵行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Windows 7, NVidia GeForce 425M.

我写了一个简单的 CUDA 代码来计算矩阵的行和.矩阵具有一维表示(指向浮点数的指针).

I wrote a simple CUDA code which calculates the row sums of a matrix. The matrix has uni-dimensional representation (pointer to a float).

代码的串行版本如下(它有 2 个循环,正如预期的那样):

The serial version of code is below (it has 2 loops, as expected):

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

在 CUDA 代码中,我调用内核函数按行扫描矩阵.下面是内核调用片段:

Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

以及执行行并行求和的核函数(仍有1循环):

and the kernel function which performs the parallel sum of the rows (still has 1 loop):

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

到目前为止一切顺利.串行和并行 (CUDA) 结果相同.

So far so good. The serial and parallel (CUDA) results are equal.

整点是 CUDA 版本的计算时间几乎是串行版本的两倍,即使我更改了 nThreadsPerBlock 参数:我从 nThreadsPerBlock 测试了 nThreadsPerBlockcode>32 到 1024(我的卡允许的每个块的最大线程数).

The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).

IMO,矩阵维度足以证明并行化的合理性:90,000 x 1,000.

IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.

下面,我报告了使用不同 nThreadsPerBlock 的串行和并行版本所用的时间.以 msec 为单位报告的平均 100 个样本的时间:

Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:

矩阵:nrow = 90000 x ncol = 1000

Matrix: nrow = 90000 x ncol = 1000

串行:每个样本平均经过的时间(以毫秒为单位)(100 个样本):289.18.

Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.

CUDA (32 ThreadsPerBlock):以毫秒为单位的每个样本平均经过时间(100 个样本):497.11.

CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.

CUDA (1024 ThreadsPerBlock):以毫秒为单位的每个样本平均经过时间(100 个样本):699.66.

CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.

以防万一,32/1024 nThreadsPerBlock 的版本是最快/最慢的.

Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.

我知道从主机复制到设备时会产生一种开销,反之亦然,但速度缓慢可能是因为我没有实现最快的代码.

I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.

由于我远非 CUDA 专家:

Since I am far from being a CUDA expert:

我是否为这项任务编写了最快的版本?我怎样才能改进我的代码?我可以摆脱内核函数中的循环吗?

任何想法表示赞赏.

虽然我描述了一个标准的rowSum,但我对AND/OR 行的(0;1} 值,例如 rowAND/rowOR.也就是说,它不允许我利用 cuBLAS 乘以 1COL 列向量技巧,正如一些评论员所建议的那样.

Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.

根据用户其他用户的建议并在此认可:

As suggest by users other users and here endorsed:

忘记尝试编写自己的函数,改用 Thrust 库,神奇就来了.

FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

推荐答案

既然你提到你需要通用的归约算法,而不仅仅是 sum.我将尝试在这里给出 3 种方法.内核方法可能具有最高的性能.推力方法最容易实施.cuBLAS 方法仅适用于 sum 并且具有良好的性能.

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.

这是一个很好的文档,介绍了如何优化标准并行缩减.标准降低可分为两个阶段.

Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.

  1. 多个线程块各自减少了一部分数据;
  2. 一个线程块从阶段 1 的结果减少到最后的 1 个元素.

对于您的多次归约(减少垫子的行数)问题,只有第 1 阶段就足够了.这个想法是每个线程块减少 1 行.有关每个线程块多行或每个多线程块 1 行等进一步考虑,您可以参考 @Novak 提供的论文.这可能会更多地提高性能,特别是对于形状不好的矩阵.

For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by @Novak. This may improve the performance more, especially for matrices with bad shape.

通过thrust::reduction_by_key 几分钟就可以完成一般的多重归约.您可以在此处找到一些讨论 确定最小元素及其在每个矩阵列中的位置与 CUDA 推力.

General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.

然而,thrust::reduction_by_key 并不假设每一行的长度都相同,所以你会受到性能损失.另一篇文章 如何以最大性能在 CUDA 中规范化矩阵列? 给出了 thrust::reduction_by_key 和 cuBLAS 方法对行总和的分析比较.它可以让您对性能有一个基本的了解.

However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.

矩阵 A 的行/列总和可以看作是矩阵向量乘法,其中向量的元素都是 1.可以用下面的matlab代码表示.

Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.

y = A * ones(size(A,2),1);

其中 y 是 A 的行数之和.

where y is the sum of rows of A.

cuBLAS 库提供了一个高性能的矩阵向量乘法函数 cublas<t>gemv() 用于此操作.

cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.

时序结果表明,该例程仅比简单地读取一次 A 的所有元素慢 10~50%,这可以看作是该操作性能的理论上限.

Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

这篇关于使用 CUDA 减少矩阵行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆