CUDA添加矩阵的行 [英] CUDA Add Rows of a Matrix

查看:122
本文介绍了CUDA添加矩阵的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将4800x9600矩阵的行加在一起,从而得到一个1x9600矩阵.

我所做的是将4800x9600分成每个长度为4800的9,600个矩阵.然后,我对4800个元素进行归约.

麻烦的是,这真的很慢...

有人有任何建议吗?

基本上,我正在尝试实现MATLAB的sum(...)函数.

这是我验证的代码,可以正常工作,只是速度很慢:

void reduceRows(Matrix Dresult,Matrix DA)
{
        //split DA into chunks
        Matrix Dchunk;
        Dchunk.h=1;Dchunk.w=DA.h;
        cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));

        Matrix DcolSum;
        DcolSum.h=1;DcolSum.w=1;
        //cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));

        int i;
        for(i=0;i<DA.w;i++)   //loop over each column
        {
                //printf("%d ",i);
                cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
                DcolSum.data=&Dresult.data[i];
                reduceTotal(DcolSum,Dchunk);
        }
        cudaFree(Dchunk.data);
}

矩阵定义为:

typedef struct{
        long w;
        long h;
        float* data;
}Matrix;

ReduceTotal()只是调用标准的NVIDIA简化,将Dchunk中的所有元素求和,并将答案放入DcolSum中.

如果找不到答案,我将在CPU上执行所有操作...;(

在此先感谢

解决方案

与其在各列上循环,不如在各列上并行化.每个4600个线程将其列中的9600个条目求和,并将总和放在结果向量的适当位置.

如果您正在寻找一个简化使用Cuda的库,我强烈建议Thrust: http://code.google.com/p/thrust/

使用Thrust,我将创建一个函子,将矩阵的指针保存在设备内存中,然后将其映射到一系列列索引上.函子的operator()将获取一个索引,对矩阵的该列中的所有内容求和,然后返回总和.这样一来,您的总和就会坐在推力:: device_vector中,而没有任何内存副本(甚至直接进行CUDA调用).

您的函子可能类似于:

struct ColumnSumFunctor {
    const Matrix matrix;

    // Make a functor to sum the matrix
    ColumnSumFunctor(const Matrix& matrix);

    // Compute and return the sum of the specified column
    __device__
    int operator()(const int& column) const;
};

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.

What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.

The trouble is, this is really slow...

Anyone got any suggestions?

Basically, I'm trying to implement MATLAB's sum(...) function.

Here is the code which I've verified works fine, it's just it's really slow:

void reduceRows(Matrix Dresult,Matrix DA)
{
        //split DA into chunks
        Matrix Dchunk;
        Dchunk.h=1;Dchunk.w=DA.h;
        cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));

        Matrix DcolSum;
        DcolSum.h=1;DcolSum.w=1;
        //cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));

        int i;
        for(i=0;i<DA.w;i++)   //loop over each column
        {
                //printf("%d ",i);
                cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
                DcolSum.data=&Dresult.data[i];
                reduceTotal(DcolSum,Dchunk);
        }
        cudaFree(Dchunk.data);
}

Matrix is defined as:

typedef struct{
        long w;
        long h;
        float* data;
}Matrix;

ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.

I'm about to do all this on the CPU if I can't find an answer... ;(

Many thanks in advance,

解决方案

Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.

If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/

Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).

Your functor might look something like:

struct ColumnSumFunctor {
    const Matrix matrix;

    // Make a functor to sum the matrix
    ColumnSumFunctor(const Matrix& matrix);

    // Compute and return the sum of the specified column
    __device__
    int operator()(const int& column) const;
};

这篇关于CUDA添加矩阵的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆