CUDA:二维数组索引提供了意外结果 [英] CUDA: 2D array indexing giving unexpected results

查看:103
本文介绍了CUDA:二维数组索引提供了意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始学习CUDA,我想编写一个简单的程序,将一些数据复制到GPU,对其进行修改,然后将其传输回去。我已经在Google周围搜索,并试图找到我的错误。我很确定问题出在我的内核中,但我不完全确定哪里出了问题。

I started learning CUDA, and I wanted to write a simple program that copied some data to the GPU, modified it, and transferred it back. I've already googled around and tried to find my mistake. I'm pretty sure that the problem is in my kernel, but I'm not completely sure what is wrong.

这是我的内核:

__global__ void doStuff(float* data, float* result)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[index] = (float) index;
    }
}

这是我的 main

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};

    float* data_array;
    float* result_array;

    size_t data_array_pitch, result_array_pitch;
    int width_in_bytes = 3 * sizeof(float);
    int height = 3;

    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);

    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);

    dim3 threads_per_block(16, 16);
    dim3 num_blocks(1,1);

    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_blocks>>>(data_array, result_array);

    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);

    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("\n");
    }

    return 0;
}

运行此命令可获得 0.000000 1.000000 2.00000 用于第一行,其余两行作为垃圾。

When I run this I get 0.000000 1.000000 2.00000 for the first row and garbage for the other two.

推荐答案

我不确定我是否会集中精力

I'm not sure I would focus on 2D arrays if you're just starting to learn cuda.

也很好奇如果您手动将代码输入到问题中,因为您有一个定义了thread_per_block 变量,但随后在内核调用中使用 threads_per_blocks

Also curious if you manually typed your code into the question, because you have a threads_per_block variable defined, but then you use threads_per_blocks in the kernel invocation.

,您的代码有几个问题:

Anyway, there are several problems with your code:


  1. 使用2D数组时,几乎总是需要传递pitch
    参数(以某种方式)到内核。 cudaMallocPitch
    在每行的末尾分配带有额外填充的数组,以便下一行
    从一个对齐的边界开始。通常
    会导致分配粒度为128或256个字节。因此,您的第一行
    具有3个有效数据实体,其后有足够的空白空间来填充
    ,例如256个字节(等于您的pitch变量是多少)。因此,我们必须更改内核调用和内核本身来解决此问题。

  2. 您的内核本质上是一维内核(它不理解或不使用 threadIdx。 y )。因此,启动2D网格毫无意义。尽管在这种情况下它不会造成任何伤害,但它会创建冗余,在其他代码中可能会造成混淆和麻烦。

  1. when using 2D arrays, it's almost always necessary to pass the pitch parameter (in some fashion) to the kernel. cudaMallocPitch allocates arrays with extra padding on the end of each row, so that the next row starts at a nicely aligned boundary. This will usually result in allocation granularity of 128 or 256 bytes. So your first row has 3 valid data entities followed by enough empty space to fill up, say 256 bytes (equal to whatever your pitch variable is). So we have to change the kernel invocation and the kernel itself to account for this.
  2. Your kernel is inherently a 1D kernel (it does not comprehend or use threadIdx.y, for example). Therefore there's not point in launching a 2D grid. Although it doesn't hurt anything in this case, it's creating redundancy which can be confusing and troublesome in other codes.

基于上述注释的代码,这些代码显示了将为您带来预期结果的一些更改:

Here's an updated code showing some changes that will give you expected results, based on the above comments:

#include <stdio.h>


__global__ void doStuff(float* data, float* result, size_t dpitch, size_t rpitch, int width)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
    }
}

int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};

    float* data_array;
    float* result_array;

    size_t data_array_pitch, result_array_pitch;
    int height = 3;
    int width = 3;
    int width_in_bytes = width * sizeof(float);

    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);

    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);

    dim3 threads_per_block(16);
    dim3 num_blocks(1,1);

    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_block>>>(data_array, result_array, data_array_pitch, result_array_pitch, width);

    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);

    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("\n");
    }
    return 0;
}

您可能还会发现此问题有趣的读物。

You might also find this question interesting reading.

编辑:在评论中回答问题:

result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
              1               2                      3

要计算正确的元素索引到倾斜数组中,我们必须:

To compute the correct element index into the pitched array we must:


  1. 从线程索引计算(虚拟)行索引。我们通过将线程索引除以每行(非节距)行的宽度(以元素为单位,而不是字节为单位)来得到整数除。

  2. 将行索引乘以每行的宽度插入行。每行 pitched 的宽度由pitched参数指定,以字节为单位。要将这个倾斜的 byte 参数转换为一个倾斜的 element 参数,我们除以每个元素的大小。然后,将数量乘以在步骤1中计算出的行索引,就可以索引到正确的行中。

  3. 从线程索引中计算(虚拟)列索引,取余数(线程索引除以宽度(以元素为单位)。一旦有了列索引(在元素中),就将其添加到步骤2中计算的正确行开始索引,以标识此线程将负责的元素。

  1. Compute the (virtual) row index from the thread index. We do this by taking integer division of the thread index by the width of each (non-pitched) row (in elements, not bytes).
  2. Multiply the row index by the width of each pitched row. The width of each pitched row is given by the pitched parameter, which is in bytes. To convert this pitched byte parameter into a pitched element parameter, we divide by the size of each element. Then by multiplying the quantity by the row index computed in step 1, we have now indexed into the correct row.
  3. Compute the (virtual) column index from the thread index by taking the remainder (modulo division) of the thread index divided by the width (in elements). Once we have the column index (in elements) we add it to the start-of-the-correct-row index computed in step 2, to identify the element that this thread will be responsible for.

以上是相对简单的操作所需要付出的大量努力,这就是为什么我建议着重于基本cuda概念而不是首先介绍变距数组的一个例子。例如,在处理倾斜阵列之前,我将弄清楚如何处理1和2D螺纹块以及1和2D网格。节距数组在某些情况下是访问2D数组(或3D数组)的有用的性能增强器,但对于处理CUDA中的多维数组而言,它们并不是必需的。

The above is a fair amount of effort for a relatively straightforward operation, which is one example of why I suggest focusing on basic cuda concepts rather than pitched arrays first. For example I would figure how to handle 1 and 2D thread blocks, and 1 and 2D grids, before tackling pitched arrays. Pitched arrays are a useful performance enhancer for accessing 2D arrays (or 3D arrays) in some instances, but they are by no means necessary to handle multidimensional arrays in CUDA.

这篇关于CUDA:二维数组索引提供了意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆