CUDA C总和1二维数组和返回的尺寸 [英] CUDA C sum 1 dimension of 2D array and return

查看:122
本文介绍了CUDA C总和1二维数组和返回的尺寸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的GPU编程(和用C,而锈),所以这可能是与我的code一个明显的bug一个相当基本的问题。我所要做的是采取一个2维数组,找到每一行每一列的总和。所以,如果我有一个二维数组,其中包含:

  0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18

我想获得包含数组以下的:

  45
45
90

在code我至今没有返回正确的输出,我不知道为什么。我猜那是因为我不是在内核处理索引正常。但是,这可能是因为我没有正确地使用内存,因为我适应这个从过于简单化的1维的例子和的 CUDA编程指南(第3.2.2节),使得1和2维数组之间的初学者相当大,不能很好地描述跳跃。

我的不正确的尝​​试:

 的#include<&stdio.h中GT;
#包括LT&;&stdlib.h中GT;
//用小阵列测试开始
#定义ROW 3
#定义COL 10__global__无效崩溃(INT *一,诠释* C){
    / *
       沿着用于2D阵列的每一行的列总结。
    * /
    INT总= 0;
    //循环拿到总,似乎是错误的GPU的,但我不知道一个更好的办法
    的for(int i = 0; I< COL;我++){
        总=总+ A [threadIdx.y + I];
    }
    C [threadIdx.x] =总;}诠释主要(无效){
    int数组[ROW] [COL]; // A,C的主机副本
    INT C [ROW]
    INT * DEV_A; //的设备拷贝,C(只是指针)
    INT * dev_c;    //得到我需要的数组的大小
    INT size_2d = ROW * COL *的sizeof(int)的;
    INT size_c = ROW * sizeof的(INT);    //分配内存
    cudaMalloc((无效**)及DEV_A,size_2d);
    cudaMalloc((无效**)及dev_c,size_c)​​;    //填充与主机的东西不多了二维数组和被称为测试
    的for(int i = 0; I<行;我++){
        如果(ⅰ== ROW - 1){
            对于(INT J = 0; J< COL; J ++){
                数组[I] [J] =(J * 2);
                的printf(%i的,数组[I] [J]);
            }
        }其他{
            对于(INT J = 0; J< COL; J ++){
                数组[I] [J] = j的;
                的printf(%i的,数组[I] [J]);
            }
        }
        的printf(\\ n);
    }    //复制内存
    cudaMemcpy(DEV_A,数组,size_2d,cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c,C,size_c,cudaMemcpyHostToDevice);    //运行核函数
    坍塌<<< ROW,COL>>>(DEV_A,dev_c);    //复制输出回主机
    cudaMemcpy(C,dev_c,size_c,cudaMemcpyDeviceToHost);    //打印输出
    的printf(\\ n);
    的for(int i = 0; I<行;我++){
        的printf(%I \\ N,C [I]);
    }    // Releasae内存
    cudaFree(DEV_A);
    cudaFree(dev_c);
}

输出:

  0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 1845
45
45


解决方案

您是正确的,这是一个索引的问题。如果更换此内核将产生一个正确的答案:

 总=总+ A [threadIdx.y + I];

本:

 总=总+ A [blockIdx.x * COL + 1];

和这样的:

  C [threadIdx.x] =总;

本:

  C [blockIdx.x] =总;

不过还有更多比说。


  1. 您遇到与一个CUDA code麻烦任何时候,你应该使用<一个href=\"http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api\">proper CUDA错误检查。第二个问题上面肯定是导致内存访问错误,你可能已经得到了这一错误检查提示。你也应该用运行codeS CUDA-MEMCHECK 这将做边界检查的一个额外的紧张工作,它肯定会赶上出界外接入你的核心是决策。


  2. 我想你可能会内核启动语法相混淆:&LT;&LT;&LT; ROW,COL&GT;&GT;&GT; 你可能会认为这映射到二维坐标线(我只是猜测,因为你使用 threadIdx.y 在一个内核在那里没有任何意义。)但是第一个参数是多少的推出,第二个是的每个块的线程。如果为这两个提供标量(如你),你将推出1D threadblocks的一维网格,你的 .Y 变量不会真的有意义(索引)。所以有外卖的是, threadIdx.y 没有做什么有用的在此设置(它始终为零)。


  3. 要解决这个问题,我们可以让这个答案的开头列出的第一个变化。需要注意的是,当我们推出3块,每块都有唯一的 blockIdx.x ,所以我们可以利用它来进行索引,我们必须通过的宽度来乘以阵列生成正确的索引。


  4. 由于第二个参数是每个块的线程数,你的索引到C也没有任何意义。仅C有3个元素(这是合理的),但是每个块有10个线程,并且在每个块的线程试图进入索引C中的前10个位置(在一个块中的每个线程具有用于<$ C的唯一值$ C> threadIdx.x ),但第3的位置后,在C中没有额外的存储空间。


  5. 现在可能是最大的问题。 在一个块中的每个线程被做正好在循环同样的事情的。您code不区分线程的行为。你可以写code,让正确的答案这种方式,但它不是从性能的角度来看明智的。


  6. 要解决最后一个问题,规范的答案是使用并行减少。这是一个棘手的话题,有关于它在这里的SO标签很多问题,所以我不会试图掩盖它,而是向你指出,有一个很好的教程的here与随行的的沿=htt​​p://docs.nvidia.com/cuda/cuda-samples/index html的CUDA#并行还原> CUDA样本code ,你可以学习。如果你想看到的矩阵的行平行的减少,例如,你可以看看这个<一个href=\"http://stackoverflow.com/questions/31706599/how-to-perform-reduction-on-a-huge-2d-matrix-along-the-row-direction-using-cuda\">question/answer.它正好是执行的最大还原,而不是一笔减少,但差异不大。你也可以使用一个原子的方法在另一个答案的建议,但通常不被认为是高性能的做法,因为原子操作的吞吐量比什么是与普通CUDA内存带宽达到较为有限。


您似乎也普遍困惑的CUDA内核执行模式,使节目指南(你已经链接)继续阅读是一个很好的起点。

I am new to GPU programming (and rather rusty in C) so this might be a rather basic question with an obvious bug in my code. What I am trying to do is take a 2 dimensional array and find the sum of each column for every row. So If I have a 2D array that contains:

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18

I want to get an array that contains the following out:

45
45
90

The code I have so far is not returning the correct output and I'm not sure why. I'm guessing it is because I am not handling the indexing in the kernel properly. But it could be that I am not using the memory correctly since I adapted this from an over-simplified 1 dimensional example and the CUDA Programming Guide (section 3.2.2) makes a rather big and not very well described jump for a beginner between 1 and 2 dimensional arrays.

My incorrect attempt:

#include <stdio.h>
#include <stdlib.h>


// start with a small array to test
#define ROW 3
#define COL 10

__global__ void collapse( int *a, int *c){
    /*
       Sum along the columns for each row of the 2D array.
    */
    int total = 0;
    // Loop to get total, seems wrong for GPUs but I dont know a better way
    for (int i=0; i < COL; i++){
        total = total + a[threadIdx.y + i];
    }
    c[threadIdx.x] = total;

}

int main( void ){
    int array[ROW][COL];      // host copies of a, c
    int c[ROW];
    int *dev_a;      // device copies of a, c (just pointers)
    int *dev_c;

    // get the size of the arrays I will need
    int size_2d = ROW * COL * sizeof(int);
    int size_c = ROW * sizeof(int);

    // Allocate the memory
    cudaMalloc( (void**)&dev_a, size_2d);
    cudaMalloc( (void**)&dev_c, size_c);

    // Populate the 2D array on host with something small and known as a test
    for (int i=0; i < ROW; i++){
        if (i == ROW - 1){
            for (int j=0; j < COL; j++){
                array[i][j] = (j*2);
                printf("%i ", array[i][j]);
            }
        } else {
            for (int j=0; j < COL; j++){
                array[i][j] = j;
                printf("%i ", array[i][j]);
            }
        }
        printf("\n");
    }

    // Copy the memory
    cudaMemcpy( dev_a, array, size_2d, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_c, c, size_c, cudaMemcpyHostToDevice );

    // Run the kernal function
    collapse<<< ROW, COL >>>(dev_a, dev_c);

    // copy the output back to the host
    cudaMemcpy( c, dev_c, size_c, cudaMemcpyDeviceToHost );

    // Print the output
    printf("\n");
    for (int i = 0; i < ROW; i++){
        printf("%i\n", c[i]);
    }

    // Releasae the memory
    cudaFree( dev_a );
    cudaFree( dev_c );
}

Output:

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18

45
45
45

解决方案

You are correct, it's an indexing issue. Your kernel will generate a correct answer if you replace this:

    total = total + a[threadIdx.y + i];

with this:

    total = total + a[blockIdx.x*COL + i];

and this:

c[threadIdx.x] = total;

with this:

c[blockIdx.x] = total;

However there's more to say than that.

  1. Any time you're having trouble with a CUDA code, you should use proper cuda error checking. The second issue above was definitely resulting in a memory access error, and you may have gotten a hint of this with error checking. You should also run your codes with cuda-memcheck which will do an extra-tight job of bounds checking, and it would definitely catch the out-of-bounds access your kernel was making.

  2. I think you may be confused with kernel launch syntax: <<<ROW, COL>>> You may be thinking that this maps into 2D thread coordinates (I'm just guessing, since you used threadIdx.y in a kernel where it has no meaning.) However the first parameter is the number of blocks to be launched, and the second is the number of threads per block. If you provide scalar quantities (as you have) for both of these, you will be launching a 1D grid of 1D threadblocks, and your .y variables won't really be meaningful (for indexing). So one takeaway is that threadIdx.y doesn't do anything useful in this setup (it is always zero).

  3. To fix that, we could make the first change listed at the beginning of this answer. Note that when we launch 3 blocks, each block will have a unique blockIdx.x so we can use that for indexing, and we have to multiply that by the "width" of your array to generate proper indexing.

  4. Since the second parameter is the number of threads per block, your indexing into C also didn't make sense. C only has 3 elements (which is sensible) but each block had 10 threads, and in each block the threads were trying into index into the "first 10" locations in C (each thread in a block has a unique value for threadIdx.x) But after the first 3 locations, there is no extra storage in C.

  5. Now possibly the biggest issue. Each thread in a block is doing exactly the same thing in the loop. Your code does not differentiate behavior of threads. You can write code that gives the correct answer this way, but it's not sensible from a performance standpoint.

  6. To fix this last issue, the canonical answer is to use a parallel reduction. That's an involved topic, and there are many questions about it here on the SO tag, so I'll not try to cover it, but point out to you that there is a good tutorial here along with the accompanying CUDA sample code that you can study. If you want to see a parallel reduction on the matrix rows, for example, you could look at this question/answer. It happens to be performing a max-reduction instead of a sum-reduction, but the differences are minor. You can also use an atomic method as suggested in the other answer, but that is generally not considered a "high-performance" approach, because the throughput of atomic operations is more limited than what is achievable with the ordinary CUDA memory bandwidth.

You also seem to be generally confused about the CUDA kernel execution model, so continued reading of the programming guide (that you've already linked) is a good starting point.

这篇关于CUDA C总和1二维数组和返回的尺寸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆