使用CUDA缩减矩阵列 [英] Reduce matrix columns with CUDA

查看:306
本文介绍了使用CUDA缩减矩阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵,我想使用CUDA,并以最快的方式计算列方式的平均值(归结为简单的和),即返回一个行向量包含每列的平均值矩阵。用于计算单个列向量的和的和简化实现如下:

I have a matrix and I would like to use CUDA and in the fastest possible way compute the column-wise mean (boils down to be simply the sum), i.e., return a row vector containing the mean of every column in that matrix. A sum reduction implementation for computing the sum of a single column vector looks like this:

template<typename T>
__global__ void kernelSum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
    extern __shared__ T sdata[];

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // load input into __shared__ memory
    T x = 0.0;
    if (tid < n) {
        x = input[tid];
    }
    sdata[threadIdx.x] = x;
    __syncthreads();

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        }
        // wait until all threads in the block have
        // updated their partial sums
        __syncthreads();
    }

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];
    }
}

这是作为调用:

int n = ... // vector size
const int BLOCK_SIZE = 1024;
int number_of_blocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
double* per_block_results = NULL;
cudaMalloc((void**) &per_block_results, sizeof(double)*(number_of_blocks + 1));
// launch one kernel to compute, per-block, a partial sum
kernelSum<double> <<<number_of_blocks, BLOCK_SIZE, BLOCK_SIZE*sizeof(double)>>>(a, per_block_results, n);
// launch a single block to compute the sum of the partial sums
kernelSum<double> <<<1, number_of_blocks, number_of_blocks*sizeof(double)>>>(per_block_results, per_block_results + number_of_blocks, number_of_blocks);



我可以将这个内核推广到任意数量的列的矩阵,但是我受到共享内存的限制。我的GPU有计算能力 3.5 ,所以它有 48KB 的共享内存和最大块大小 1024 即每个块的线程数。由于我对双精度感兴趣,我有 48 * 1024/8 = 6144 共享内存的最大双倍。因为减少是按每个块完成的,我可以有最大值 6144(共享内存中的双精度)/ 1024(块大小)= 6 列,我可以计算总和减少。减小块大小将允许同时计算更多列。 6144(共享内存中的两倍)/ 512(块大小)= 12

I could generalize this kernel to matrices of any number of columns but I'm limited by the shared memory. My GPU has compute capability 3.5 so it has 48KB of shared memory and a maximum block size of 1024 i.e. number of threads per block. Since I am interested in double-precision, I have 48*1024/8= 6144 maximum doubles of shared memory. Since the reduction is done per block, I can have a maximum of 6144 (doubles in shared memory) / 1024 (block size) = 6 columns for which I can compute the sum reduction simultaneously. Reducing the block size then would allow to compute more columns simultaneously e.g. 6144 (doubles in shared memory) / 512 (block size) = 12.

简单的CPU循环遍历矩阵的每一列并调用总和减少。还有另一个更好的方法吗?

Would this more complex strategy beat the simple CPU loop over every column of the matrix and invoke the sum reduction. Is there yet another better way to do this?

推荐答案

什么阻止你这样做:

template<typename T>
__global__ void kernelSum(const T* __restrict__ input, 
                          T* __restrict__ per_block_results, 
                          const size_t lda, const size_t n)
{
    extern __shared__ T sdata[];

    // Accumulate per thread partial sum
    T x = 0.0;
    T * p = &input[blockIdx.x * lda];
    for(int i=threadIdx.x; i < n; i += blockDim.x) {
        x += p[i];
    }

    // load partial sum into __shared__ memory
    sdata[threadIdx.x] = x;
    __syncthreads();

    // contiguous range pattern
    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            // add a partial sum upstream to our own
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        }
        // wait until all threads in the block have
        // updated their partial sums
        __syncthreads();
    }

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];
    }
}

[标准免责声明:测试,自行使用]

[standard disclaimer: written in browser, never compiled or tested, use at own risk]

ie。您只需在 sdata 中为块中的每个线程需要一个条目用于共享内存减少。每个线程根据需要合计尽可能多的值以覆盖整列输入。

ie. you only need one entry in sdata for each thread in the block for the shared memory reduction. Each thread sums as many values as required to cover the full column input. Then there is no shared memory restriction, you can sum any size column with the same block size.

编辑:显然,使用每线程部分和的想法是新的,所以这里是一个完整的例子来研究:

Apparently the idea of using per thread partial sums is new to you, so here is a complete example to study:

#include <iostream>

template<typename T>
__global__ 
void kernelSum(const T* __restrict__ input, 
        const size_t lda, // pitch of input in words of sizeof(T)
        T* __restrict__ per_block_results, 
                const size_t n)
{
    extern __shared__ T sdata[];

    T x = 0.0;
    const T * p = &input[blockIdx.x * lda];
    // Accumulate per thread partial sum
    for(int i=threadIdx.x; i < n; i += blockDim.x) {
        x += p[i];
    }

    // load thread partial sum into shared memory
    sdata[threadIdx.x] = x;
    __syncthreads();

    for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
        if(threadIdx.x < offset) {
            sdata[threadIdx.x] += sdata[threadIdx.x + offset];
        }
        __syncthreads();
    }

    // thread 0 writes the final result
    if(threadIdx.x == 0) {
        per_block_results[blockIdx.x] = sdata[0];
    }
}


int main(void)
{
    const int m = 10000, n = 16;
    float * a = new float[m*n];

    for(int i=0; i<(m*n); i++) { a[i] = (float)(i%10); }

    float *a_;
    size_t size_a = m * n * sizeof(float);
    cudaMalloc((void **)&a_, size_a);
    cudaMemcpy(a_, a, size_a, cudaMemcpyHostToDevice);

    float *b_;
    size_t size_b = n * sizeof(float);
    cudaMalloc((void **)&b_, size_b);

    // select number of warps per block according to size of the
    // colum and launch one block per column. Probably makes sense
    // to have at least 4:1 column size to block size
    dim3 blocksize(256); 
    dim3 gridsize(n);
    size_t shmsize = sizeof(float) * (size_t)blocksize.x;
    kernelSum<float><<<gridsize, blocksize, shmsize>>>(a_, b_, m, m);

    float * b = new float[n];
    cudaMemcpy(b, b_, size_b, cudaMemcpyDeviceToHost);

    for(int i=0; i<n; i++) {
       std::cout << i << " " << b[i] << std::endl;
    }

    cudaDeviceReset();

    return 0;
} 

您应该试验相对于矩阵大小的块大小,但是一般来说内核每个线程工作越多,整体性能就越好(因为共享内存减少相当昂贵)。您可以在此答案中查看阻止和网格大小启发式的类似内存带宽限制问题的一种方法。

You should experiment with the block size relative to your matrix size for optimal performance, but in general the more work per thread the kernel does, the better the overall performance will be (because the shared memory reduction is quite expensive). You can see one approach to block and grid size heuristics for similarly memory bandwidth bound problem in this answer.

这篇关于使用CUDA缩减矩阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆