Dot产品在CUDA使用原子操作 - 得到错误的结果 [英] Dot Product in CUDA using atomic operations - getting wrong results

查看:294
本文介绍了Dot产品在CUDA使用原子操作 - 得到错误的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在CUDA中实现点积,并将结果与​​MATLAB返回的结果进行比较。我的CUDA代码(基于本教程)如下:

  #include< stdio.h> 

#define N(2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float

//内核 - DOT PRODUCT
__global__ void dot(num_t * a,num_t * b,num_t * c)
{
__shared__ num_t temp [THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp [threadIdx.x] = a [index] * b [index];
__syncthreads(); // Synchronize!
* c = 0.00;
//是否需要tid == 0
//承担这个任务?
if(0 == threadIdx.x){
num_t sum = 0.00;
int i;
for(i = 0; i sum + = temp [i];
atomicAdd(c,sum);
//错误:* c + = sum;这种读写操作必须是原子!
}
}


//初始化向量:
void init_vector(num_t * x)
{
int i ;
for(i = 0; i x [i] = 0.001 * i;
}
}

// MAIN
int main(void)
{
num_t * a,* b,* c;
num_t * dev_a,* dev_b,* dev_c;
size_t size = N * sizeof(num_t);

cudaMalloc((void **)& dev_a,size);
cudaMalloc((void **)& dev_b,size);
cudaMalloc((void **)& dev_c,size);

a =(num_t *)malloc(size);
b =(num_t *)malloc(size);
c =(num_t *)malloc(size);

init_vector(a);
init_vector(b);

cudaMemcpy(dev_a,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,size,cudaMemcpyHostToDevice);

dot<<<<< N / THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(dev_a,dev_b,dev_c)

cudaMemcpy(c,dev_c,sizeof(num_t),cudaMemcpyDeviceToHost);

printf(a = [\\\
);
int i;
for(i = 0; i <10; i ++){
printf(%g\\\
,a [i]);
}
printf(... \\\
);
for(i = N-10; i printf(%g\\\
,a [i]);
}
printf(] \\\
\\\
);
printf(a * b =%g.\\\
,* c);


free(a);游离(b);游离(c);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

}


$ b

  /usr/local/cuda-5.0/bin/nvcc -m64 -I / usr / local / cuda-5.0 / include -gencode arch = compute_20, code = sm_20 -o multi_dot_product.o -c multi_dot_product.cu 
g ++ -m64 -o multi_dot_product multi_dot_product.o -L / usr / local / cuda-5.0 / lib64 -lcudart

有关我的NVIDIA卡的信息,请访问 http ://pastebin.com/8yTzXUuK 。我尝试使用以下简单代码验证MATLAB中的结果:

  N = 2048 * 8; 
a = zeros(N,1)
for i = 1:N
a(i)= 0.001 *(i-1);
end

dot_product = a'* a;

但是随着N增加,我得到的结果有很大的不同(例如,对于N = 2048 * 32 CUDA reutrns 6.73066e + 07,而MATLAB返回9.3823e + 07。对于N = 2048 * 64,CUDA给出3.28033e + 08,而MATLAB给出7.5059e + 08。我倾向于相信这个差异源于在我的C代码中使用 float ,但如果我替换为 double 编译器指出 atomicAdd 不支持双参数。

更新:此外,对于 N (例如2048 * 64),我注意到CUDA返回的结果在每次运行时都会发生变化。如果 N 较低(例如2048 * 8),就不会发生这种情况。



更基本的问题:变量 temp 是一个大小为 THREADS_PER_BLOCK 的数组,并在同一个块中的线程之间共享。它是否在块之间共享或每个块在此变量的不同副本上操作?我应该认为方法 dot 作为每个块的指令吗?

解决方案

有人可以详细说明作业如何拆分以及如何共享变量

kernel:

  // * c = 0.00; 

在内核调用之前,将这些行添加到主机代码中(cudaMalloc < $ c> dev_c ):

  num_t h_c = 0.0f; 
cudaMemcpy(dev_c,& h_c,sizeof(num_t),cudaMemcpyHostToDevice);

我相信你会得到更多或更少匹配matlab的结果。



事实上,你在内核中没有任何同步所保护的这行代码会让你失望。每个块的每个线程,每当它们碰巧执行时,都会按照你所写的那样清除 c



顺便说一下,通过使用经典的并行缩减方法,我们可以做得更好。基本(未优化)图示为这里。如果将该方法与共享内存的使用以及最后一个atomicAdd(每个块一个原子加法)结合使用,您将获得显着改进的实现。虽然它不是点产品,但此示例会将这些提示。



编辑:回应以下注释中的问题:



内核函数是一组指令在网格(与内核启动相关联的所有线程,通过定义)的所有线程执行。然而,将执行看作由threadblock管理是合理的,因为线程块中的线程在很大程度上一起执行。然而,即使在一个线程块中,执行并不是在所有线程的完美锁步。通常,当我们考虑锁步执行时,我们认为一个 warp 是一个单线程块中的32个线程的组。因此,由于在块内的经线中的执行可能偏斜,即使对于单个线程块也存在这种危险。但是,如果只有一个线程块,我们可以使用适当的同步和控制机制,如 __ syncthreads() (如果threadIdx.x == 0)等等。但是这些机制对于控制跨多个线程块的执行的一般情况是无用的。多个线程块可以按任意顺序执行。整个网格中唯一定义的同步机制是内核启动本身。因此,为了解决您的问题,我们必须在内核启动前清除 c


I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:

#include <stdio.h>

#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float

// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c) 
{
  __shared__ num_t temp[THREADS_PER_BLOCK];
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  temp[threadIdx.x] = a[index] * b[index];
  __syncthreads(); //Synchronize!
  *c = 0.00;
  // Does it need to be tid==0 that
  // undertakes this task?
  if (0 == threadIdx.x) {
    num_t sum = 0.00;
    int i;
    for (i=0; i<THREADS_PER_BLOCK; i++)
      sum += temp[i];
    atomicAdd(c, sum);        
    //WRONG: *c += sum; This read-write operation must be atomic!
  }
}


// Initialize the vectors:
void init_vector(num_t *x)
{
  int i;
  for (i=0 ; i<N ; i++){
    x[i] = 0.001 * i;
  }
}

// MAIN
int main(void)
{
  num_t *a, *b, *c;
  num_t *dev_a, *dev_b, *dev_c;
  size_t size = N * sizeof(num_t);

  cudaMalloc((void**)&dev_a, size);
  cudaMalloc((void**)&dev_b, size);
  cudaMalloc((void**)&dev_c, size);

  a = (num_t*)malloc(size);
  b = (num_t*)malloc(size);
  c = (num_t*)malloc(size);

  init_vector(a);
  init_vector(b);

  cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

  dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);

  cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);

  printf("a = [\n");
  int i;
  for (i=0;i<10;i++){
    printf("%g\n",a[i]);
  }
  printf("...\n");
  for (i=N-10;i<N;i++){
    printf("%g\n",a[i]);
  }
  printf("]\n\n");
  printf("a*b = %g.\n", *c);


  free(a); free(b); free(c);

  cudaFree(dev_a);
  cudaFree(dev_b);
  cudaFree(dev_c);

}

and I compile it with:

/usr/local/cuda-5.0/bin/nvcc -m64  -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart

Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:

N = 2048 * 8;
a = zeros(N,1);
for i=1:N
    a(i) = 0.001*(i-1);
end

dot_product = a'*a;

But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?

Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).

At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example

解决方案

Comment this line out of your kernel:

//   *c = 0.00;

And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):

num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);

And I believe you'll get results that match matlab, more or less.

The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.

By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.

Edit: responding to a question below in the comments:

A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

这篇关于Dot产品在CUDA使用原子操作 - 得到错误的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆