如何通过减少找到CUDA中的数组的总和 [英] How to find the sum of array in CUDA by reduction

查看:120
本文介绍了如何通过减少找到CUDA中的数组的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在实现一个函数,通过使用reduce找到一个数组的和,我的数组有32 * 32个元素,其值为0 ... 1023。
我的预期总和值是523776,但是我的reult是15872,它错了。
这是我的代码:

I'm implementing a function to find the sum of an array by using reduction, my array have 32*32 elements and its values is 0 ... 1023. The my expected sum value is 523776, but my reult is 15872, it wrong. Here is my code:

#include <stdio.h>
#include <cuda.h>

#define w 32
#define h 32
#define N w*h

__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);

int main( void ) {
    int a[N], b[N]; // copies of a, b, c
    int *dev_a, *dev_b; // device copies of a, b, c
    int size = N * sizeof( int ); // we need space for 512 integers

    // allocate device copies of a, b, c
    cudaMalloc( (void**)&dev_a, size );
    cudaMalloc( (void**)&dev_b, size );

    fill_array( a, N );

    // copy inputs to device
    cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );

    dim3 blocksize(16,16);
    dim3 gridsize;

    gridsize.x=(w+blocksize.x-1)/blocksize.x;
    gridsize.y=(h+blocksize.y-1)/blocksize.y;

    reduce<<<gridsize, blocksize>>>(dev_a, dev_b);

    // copy device result back to host copy of c
    cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );

    printf("Reduced sum of Array elements = %d \n", b[0]);

    cudaFree( dev_a );
    cudaFree( dev_b );

    return 0;
}

__global__ void reduce(int *g_idata, int *g_odata) {

    __shared__ int sdata[256];

    // each thread loads one element from global to shared mem
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();
    // do reduction in shared mem
    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
}

// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
    for (int i = 0; i < n; i++)
        a[i] = i;
}


推荐答案


  1. 您正在为 atomicAdd 您的 dev_b 数组,但是您不是将该元素初始化为已知值(即0)。当然,在运行内核之前,您要将 b 复制到 dev_b ,但由于您尚未初始化 b 到任何已知的值,这将没有帮助。在C或C ++中,数组 b 不会自动初始化为零,如果这是你想的。我们可以通过将 b [0] 设置为零,然后再将 b 更改为 dev_b

  1. You are doing atomicAdd to the first element in your dev_b array, but you are not initializing that element to a known value (i.e. 0). Sure, before you run the kernel, you are copying b to dev_b, but since you haven't initialized b to any known values, that won't help. The array b is not automatically initialized to zero in C or C++, if that is what you were thinking. We can fix this by setting b[0] to zero, before copying b to dev_b.

您的缩小内核被写入以处理1D案例(即,使用的唯一线程索引是基于 .x 值),但是你正在启动一个带有2D线程块和网格的内核。这个不匹配将无法正常工作,我们需要启动1D线程块和网格,否则重写内核以使用2D索引(即 .x .y )。

Your reduction kernel is written to handle a 1D case (i.e. the only thread index used is a 1D thread index based on the .x values), but you are launching a kernel with 2D threadblocks and grids. This mismatch won't work properly and we either need to launch a 1D threadblock and grid, or else re-write the kernel to work with 2D indices (i.e. .x and .y). I've chosen the former (1D).

这是一个工作示例,似乎产生了正确的结果:

Here is a worked example with those changes to your code, it seems to produce the correct result:

$ cat t1218.cu
#include <stdio.h>

#define w 32
#define h 32
#define N w*h

__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);

int main( void ) {
    int a[N], b[N]; // copies of a, b, c
    int *dev_a, *dev_b; // device copies of a, b, c
    int size = N * sizeof( int ); // we need space for 512 integers

    // allocate device copies of a, b, c
    cudaMalloc( (void**)&dev_a, size );
    cudaMalloc( (void**)&dev_b, size );

    fill_array( a, N );
    b[0] = 0;  //initialize the first value of b to zero
    // copy inputs to device
    cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );

    dim3 blocksize(256); // create 1D threadblock
    dim3 gridsize(N/blocksize.x);  //create 1D grid

    reduce<<<gridsize, blocksize>>>(dev_a, dev_b);

    // copy device result back to host copy of c
    cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );

    printf("Reduced sum of Array elements = %d \n", b[0]);
    printf("Value should be: %d \n", ((N-1)*(N/2)));
    cudaFree( dev_a );
    cudaFree( dev_b );

    return 0;
}

__global__ void reduce(int *g_idata, int *g_odata) {

    __shared__ int sdata[256];

    // each thread loads one element from global to shared mem
    // note use of 1D thread indices (only) in this kernel
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();
    // do reduction in shared mem
    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
}

// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
    for (int i = 0; i < n; i++)
        a[i] = i;
}
$ nvcc -o t1218 t1218.cu
$ cuda-memcheck ./t1218
========= CUDA-MEMCHECK
Reduced sum of Array elements = 523776
Value should be: 523776
========= ERROR SUMMARY: 0 errors
$

注意:


  1. 内核和您的代码取决于 N 是线程块大小的确切倍数(256)。

  1. The kernel and your code as written depend on N being an exact multiple of the threadblock size (256). That is satisfied for this case, but things will break if it is not.

我没有看到任何证据证明proper cuda错误检查。它不会在这里打开任何东西,但它的良好做法。作为一个快速测试,使用 cuda-memcheck 运行您的代码。

I don't see any evidence of proper cuda error checking. It wouldn't have turned up anything here, but its good practice. As a quick test, run your code with cuda-memcheck as I have done here.

这篇关于如何通过减少找到CUDA中的数组的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆