无法在CUDA中找到1到100个数字的简单和? [英] Unable to find simple sum of 1 to 100 numbers in CUDA?

查看:103
本文介绍了无法在CUDA中找到1到100个数字的简单和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究使用CUDA的图像处理算法。在我的算法中,我想使用CUDA内核找到图像所有像素的总和。所以我在cuda中做了核方法来测量16位灰度图像的所有像素的总和,但是我得到了错误的答案。
所以我在cuda中编写了一个简单的程序,用于查找1到100个数字的总和,下面是我的代码。
在我的代码中,我使用GPU不能确切得到1到100个数字的和,但是我可以使用CPU精确得到1到100个数字的和。那我在该代码中做了什么?

I am working on image processing algorithm using CUDA. In my algorithm i want to find sum of all pixels of image using CUDA kernel. so i made kernel method in cuda for measure sum of all pixels of 16 bit gray scale image, but i got wrong answer. So i make simple program in cuda for find sum of 1 to 100 numbers and my code is below. In my code i got not exact sum of that 1 to 100 numbers using GPU, but i got exact sum of that 1 to 100 numbers using CPU. So what i had done in that code ?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <malloc.h>
#include <limits>
#include <math.h>

using namespace std;

__global__ void computeMeanValue1(double *pixels,double *sum){

        int x = threadIdx.x;

        sum[0] = sum[0] + (pixels[(x)]);
        __syncthreads();
}

int main(int argc, char **argv)
{
    double *data;
    double *dev_data;
    double *dev_total;
    double *total;

    data=new double[(100) * sizeof(double)];
    total=new double[(1) * sizeof(double)];

    double cpuSum=0.0;

    for(int i=0;i<100;i++){
        data[i]=i+1;
        cpuSum=cpuSum+data[i];
    }
    cout<<"CPU total = "<<cpuSum<<std::endl;

    cudaMalloc( (void**)&dev_data, 100 * sizeof(double));
    cudaMalloc( (void**)&dev_total, 1 * sizeof(double));

    cudaMemcpy(dev_data, data, 100 * sizeof(double), cudaMemcpyHostToDevice);

    computeMeanValue1<<<1,100>>>(dev_data,dev_total);
    cudaDeviceSynchronize();

    cudaMemcpy(total, dev_total, 1* sizeof(double), cudaMemcpyDeviceToHost);
    cout<<"GPU total = "<<total[0]<<std::endl;

    cudaFree(dev_data);
    cudaFree(dev_total);

    free(data);
    free(total);

    getch();
    return 0;
}


推荐答案

所有线程都在写入

sum[0] = sum[0] + (pixels[(x)]);

您无法执行此操作并期望得到正确的结果。您的内核需要采用不同的方法,以避免从不同的线程写入同一内​​存。通常用于执行此操作的模式是减少。简单地说,每个线程负责将数组中的一个元素块求和,然后存储结果。通过使用一系列此类归约运算,可以对数组的全部内容求和。

You can't do this and expect to get the correct result. Your kernel needs to take a different approach to avoid writing to the same memory from different threads. The pattern usually employed for doing this is reduction. Simply put with a reduction each thread is responsible for summing a block of elements within the array and then storing the result. By employing a series of these reduction operations its possible to sum the entire contents of the array.

__global__ void block_sum(const float *input,
                          float *per_block_results,
                          const size_t n)
{
  extern __shared__ float sdata[];
  unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

  // load input into __shared__ memory
  float x = 0;
  if(i < n)
  {
    x = input[i];
  }
  sdata[threadIdx.x] = x;
  __syncthreads();

  // contiguous range pattern
  for(int offset = blockDim.x / 2;
      offset > 0;
      offset >>= 1)
  {
    if(threadIdx.x < offset)
    {
      // add a partial sum upstream to our own
      sdata[threadIdx.x] += sdata[threadIdx.x + offset];
    }

    // wait until all threads in the block have
    // updated their partial sums
    __syncthreads();
  }

  // thread 0 writes the final result
  if(threadIdx.x == 0)
  {
    per_block_results[blockIdx.x] = sdata[0];
  }
}

每个线程写入<$ c中的其他位置$ c> sdata [threadIdx.x] 没有竞争条件。线程可以自由访问 sdata 中的其他元素,因为它们仅从它们中读取,因此没有竞争条件。请注意,使用 __ syncthreads()来确保将数据加载到 sdata 的操作在线程开始执行之前完成。读取数据并第二次调用 __ syncthreads(),以确保在从 sdata [0]复制最终结果之前,所有求和操作均已完成。 / code>。请注意,只有线程0将其结果写入 per_block_results [blockIdx.x] ,因此那里也没有竞争条件。

Each thread writes to a different location in sdata[threadIdx.x] there is no race condition. Threads are free to access other elements in sdata because they only read from them so there are no race conditions. Note the use of __syncthreads() to ensure that the operations to load data into sdata are complete before the threads start to read the data and the second call to __syncthreads() to ensure that all the summation operations have completed before copying the final result from sdata[0]. Note that only thread 0 writes its result to per_block_results[blockIdx.x], so there is no race condition there either.

您可以在 Google代码(我没有写这个)。该幻灯片提供了减少CUDA的合理摘要 。它包括一些图表,这些图表确实有助于理解交错的内存读写之间如何互不冲突。

You can find the complete sample code for the above on Google Code (I did not write this). This slide deck has a reasonable summary of reductions in CUDA. It includes diagrams which really help in understanding how the interleaved memory reads and writes do not conflict with each other.

您可以找到许多其他有关有效减少约数的方法的资料。 GPU。确保实现能够最有效地利用内存,这对于从内存限制操作(例如减少内存)中获得最佳性能至关重要。

You can find lots of other material on efficient implementations of reduction on GPUs. Ensuring that your implementation makes most efficient use of memory is key to getting the best performance out of a memory bound operation like reduction.

这篇关于无法在CUDA中找到1到100个数字的简单和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆