如何通过减少找到CUDA中的数组的总和 [英] How to find the sum of array in CUDA by reduction
问题描述
我正在实现一个函数,通过使用reduce找到一个数组的和,我的数组有32 * 32个元素,其值为0 ... 1023。
我的预期总和值是523776,但是我的reult是15872,它错了。
这是我的代码:
I'm implementing a function to find the sum of an array by using reduction, my array have 32*32 elements and its values is 0 ... 1023. The my expected sum value is 523776, but my reult is 15872, it wrong. Here is my code:
#include <stdio.h>
#include <cuda.h>
#define w 32
#define h 32
#define N w*h
__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);
int main( void ) {
int a[N], b[N]; // copies of a, b, c
int *dev_a, *dev_b; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
fill_array( a, N );
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
dim3 blocksize(16,16);
dim3 gridsize;
gridsize.x=(w+blocksize.x-1)/blocksize.x;
gridsize.y=(h+blocksize.y-1)/blocksize.y;
reduce<<<gridsize, blocksize>>>(dev_a, dev_b);
// copy device result back to host copy of c
cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );
printf("Reduced sum of Array elements = %d \n", b[0]);
cudaFree( dev_a );
cudaFree( dev_b );
return 0;
}
__global__ void reduce(int *g_idata, int *g_odata) {
__shared__ int sdata[256];
// each thread loads one element from global to shared mem
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if (index < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
// write result for this block to global mem
if (threadIdx.x == 0)
atomicAdd(g_odata,sdata[0]);
}
// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
for (int i = 0; i < n; i++)
a[i] = i;
}
推荐答案
-
您正在为
atomicAdd
您的dev_b
数组,但是您不是将该元素初始化为已知值(即0)。当然,在运行内核之前,您要将b
复制到dev_b
,但由于您尚未初始化b
到任何已知的值,这将没有帮助。在C或C ++中,数组b
不会自动初始化为零,如果这是你想的。我们可以通过将b [0]
设置为零,然后再将b
更改为dev_b
You are doing
atomicAdd
to the first element in yourdev_b
array, but you are not initializing that element to a known value (i.e. 0). Sure, before you run the kernel, you are copyingb
todev_b
, but since you haven't initializedb
to any known values, that won't help. The arrayb
is not automatically initialized to zero in C or C++, if that is what you were thinking. We can fix this by settingb[0]
to zero, before copyingb
todev_b
.
您的缩小内核被写入以处理1D案例(即,使用的唯一线程索引是基于 .x
值),但是你正在启动一个带有2D线程块和网格的内核。这个不匹配将无法正常工作,我们需要启动1D线程块和网格,否则重写内核以使用2D索引(即 .x
和 .y
)。
Your reduction kernel is written to handle a 1D case (i.e. the only thread index used is a 1D thread index based on the .x
values), but you are launching a kernel with 2D threadblocks and grids. This mismatch won't work properly and we either need to launch a 1D threadblock and grid, or else re-write the kernel to work with 2D indices (i.e. .x
and .y
). I've chosen the former (1D).
这是一个工作示例,似乎产生了正确的结果:
Here is a worked example with those changes to your code, it seems to produce the correct result:
$ cat t1218.cu
#include <stdio.h>
#define w 32
#define h 32
#define N w*h
__global__ void reduce(int *g_idata, int *g_odata);
void fill_array (int *a, int n);
int main( void ) {
int a[N], b[N]; // copies of a, b, c
int *dev_a, *dev_b; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
fill_array( a, N );
b[0] = 0; //initialize the first value of b to zero
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
dim3 blocksize(256); // create 1D threadblock
dim3 gridsize(N/blocksize.x); //create 1D grid
reduce<<<gridsize, blocksize>>>(dev_a, dev_b);
// copy device result back to host copy of c
cudaMemcpy( b, dev_b, sizeof( int ) , cudaMemcpyDeviceToHost );
printf("Reduced sum of Array elements = %d \n", b[0]);
printf("Value should be: %d \n", ((N-1)*(N/2)));
cudaFree( dev_a );
cudaFree( dev_b );
return 0;
}
__global__ void reduce(int *g_idata, int *g_odata) {
__shared__ int sdata[256];
// each thread loads one element from global to shared mem
// note use of 1D thread indices (only) in this kernel
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if (index < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
// write result for this block to global mem
if (threadIdx.x == 0)
atomicAdd(g_odata,sdata[0]);
}
// CPU function to generate a vector of random integers
void fill_array (int *a, int n)
{
for (int i = 0; i < n; i++)
a[i] = i;
}
$ nvcc -o t1218 t1218.cu
$ cuda-memcheck ./t1218
========= CUDA-MEMCHECK
Reduced sum of Array elements = 523776
Value should be: 523776
========= ERROR SUMMARY: 0 errors
$
注意:
-
内核和您的代码取决于
N
是线程块大小的确切倍数(256)。
The kernel and your code as written depend on
N
being an exact multiple of the threadblock size (256). That is satisfied for this case, but things will break if it is not.
我没有看到任何证据证明proper cuda错误检查。它不会在这里打开任何东西,但它的良好做法。作为一个快速测试,使用 cuda-memcheck
运行您的代码。
I don't see any evidence of proper cuda error checking. It wouldn't have turned up anything here, but its good practice. As a quick test, run your code with cuda-memcheck
as I have done here.
这篇关于如何通过减少找到CUDA中的数组的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!