如何通过减少找到CUDA中的数组的总和 [英] How to find the sum of array in CUDA by reduction

查看：120 发布时间：2017/3/4 14:10:47 cuda reduction

本文介绍了如何通过减少找到CUDA中的数组的总和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在实现一个函数，通过使用reduce找到一个数组的和，我的数组有32 * 32个元素，其值为0 ... 1023。
我的预期总和值是523776，但是我的reult是15872，它错了。
这是我的代码：

I'm implementing a function to find the sum of an array by using reduction, my array have 32*32 elements and its values is 0 ... 1023. The my expected sum value is 523776, but my reult is 15872, it wrong. Here is my code:

#include <stdio.h>
#include <cuda.h>

#define w 32
#define h 32
#define N w*h

__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);

int main( void ) {
    int a[N], b[N]; // copies of a, b, c
    int *dev_a, *dev_b; // device copies of a, b, c
    int size = N * sizeof( int ); // we need space for 512 integers

    // allocate device copies of a, b, c
    cudaMalloc( (void**)&dev_a, size );
    cudaMalloc( (void**)&dev_b, size );

    fill_array( a, N );

    // copy inputs to device
    cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );

    dim3 blocksize(16,16);
    dim3 gridsize;

    gridsize.x=(w+blocksize.x-1)/blocksize.x;
    gridsize.y=(h+blocksize.y-1)/blocksize.y;

    reduce<<<gridsize, blocksize>>>(dev_a, dev_b);

    // copy device result back to host copy of c
    cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );

    printf("Reduced sum of Array elements = %d \n", b[0]);

    cudaFree( dev_a );
    cudaFree( dev_b );

    return 0;
}

__global__ void reduce(int *g_idata, int *g_odata) {

    __shared__ int sdata[256];

    // each thread loads one element from global to shared mem
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();
    // do reduction in shared mem
    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
}

// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
    for (int i = 0; i < n; i++)
        a[i] = i;
}

推荐答案

您正在为 atomicAdd 您的 dev_b 数组，但是您不是将该元素初始化为已知值（即0）。当然，在运行内核之前，您要将 b 复制到 dev_b ，但由于您尚未初始化 b 到任何已知的值，这将没有帮助。在C或C ++中，数组 b 不会自动初始化为零，如果这是你想的。我们可以通过将 b [0] 设置为零，然后再将 b 更改为 dev_b



You are doing atomicAdd to the first element in your dev_b array, but you are not initializing that element to a known value (i.e. 0).  Sure, before you run the kernel, you are copying b to dev_b, but since you haven't initialized b to any known values, that won't help.  The array b is not automatically initialized to zero in C or C++, if that is what you were thinking.  We can fix this by setting b[0] to zero, before copying b to dev_b.
您的缩小内核被写入以处理1D案例（即，使用的唯一线程索引是基于 .x 值），但是你正在启动一个带有2D线程块和网格的内核。这个不匹配将无法正常工作，我们需要启动1D线程块和网格，否则重写内核以使用2D索引（即 .x 和 .y ）。 
Your reduction kernel is written to handle a 1D case (i.e. the only thread index used is a 1D thread index based on the .x values), but you are launching a kernel with 2D threadblocks and grids.  This mismatch won't work properly and we either need to launch a 1D threadblock and grid, or else re-write the kernel to work with 2D indices (i.e. .x and .y).  I've chosen the former (1D).
这是一个工作示例，似乎产生了正确的结果：
Here is a worked example with those changes to your code, it seems to produce the correct result:
$ cat t1218.cu
#include <stdio.h>

#define w 32
#define h 32
#define N w*h

__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);

int main( void ) {
    int a[N], b[N]; // copies of a, b, c
    int *dev_a, *dev_b; // device copies of a, b, c
    int size = N * sizeof( int ); // we need space for 512 integers

    // allocate device copies of a, b, c
    cudaMalloc( (void**)&dev_a, size );
    cudaMalloc( (void**)&dev_b, size );

    fill_array( a, N );
    b[0] = 0;  //initialize the first value of b to zero
    // copy inputs to device
    cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );

    dim3 blocksize(256); // create 1D threadblock
    dim3 gridsize(N/blocksize.x);  //create 1D grid

    reduce<<<gridsize, blocksize>>>(dev_a, dev_b);

    // copy device result back to host copy of c
    cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );

    printf("Reduced sum of Array elements = %d \n", b[0]);
    printf("Value should be: %d \n", ((N-1)*(N/2)));
    cudaFree( dev_a );
    cudaFree( dev_b );

    return 0;
}

__global__ void reduce(int *g_idata, int *g_odata) {

    __shared__ int sdata[256];

    // each thread loads one element from global to shared mem
    // note use of 1D thread indices (only) in this kernel
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();
    // do reduction in shared mem
    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
}

// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
    for (int i = 0; i < n; i++)
        a[i] = i;
}
$ nvcc -o t1218 t1218.cu
$ cuda-memcheck ./t1218
========= CUDA-MEMCHECK
Reduced sum of Array elements = 523776
Value should be: 523776
========= ERROR SUMMARY: 0 errors
$

注意：
 
  内核和您的代码取决于 N 是线程块大小的确切倍数（256）。 

The kernel and your code as written depend on N being an exact multiple of the threadblock size (256).  That is satisfied for this case, but things will break if it is not.
我没有看到任何证据证明proper cuda错误检查。它不会在这里打开任何东西，但它的良好做法。作为一个快速测试，使用 cuda-memcheck 运行您的代码。
I don't see any evidence of proper cuda error checking.  It wouldn't have turned up anything here, but its good practice.  As a quick test, run your code with cuda-memcheck as I have done here.

                        这篇关于如何通过减少找到CUDA中的数组的总和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何通过减少找到CUDA中的数组的总和 [英] How to find the sum of array in CUDA by reduction

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

如何通过减少找到CUDA中的数组的总和 [英] How to find the sum of array in CUDA by reduction

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭