Cuda 编程直方图 [英] Cuda programming histogram

查看:11
本文介绍了Cuda 编程直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行一个 cuda 程序,但我是初学者.我必须为直方图编写一个程序.但是有桶.根据 maxValue(示例中的 40),该数字将被添加到相应的存储桶中.如果我们有 4 个桶:

I want to run a cuda program, but I am a beginner. I have to write a program for a histogram. But with buckets. Depending on the maxValue(40 in the example) the number will be added to the appropriate bucket. If we have 4 buckets:

历史:|1 |10 |30 |39 |32 |2 |4 |5 |1 |

histo: | 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |

0-9(第一个桶)

10-19(第二桶)

20-29(第三桶)

30-39(第四桶)

我的 GPU 具有 Compute Capability 1.1.

我正在尝试为每个线程在他的临时表中添加他的值的块共享 temp[]:

I was trying to do something like having a shared temp[] for a block that each thread is adding his values on his temp table:

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{
     extern __shared__ unsigned int temp[];
     temp[threadIdx.x] = 0;
     __syncthreads();

     int i = threadIdx.x + blockIdx.x * blockDim.x;
     int offset = blockDim.x * gridDim.x;
     int bucketID;
     while (i < size)
     {
              bucketID = array[i]/Bwidth;
              atomicAdd( &temp[bucketID], 1);
              i += offset;
     }
     __syncthreads();


    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

histo_kernel_optimized <<<array_size/buckets, buckets,buckets*sizeof(unsigned int)>>>(buffer,SIZE, histogram)

但是编译器说:指令{atom,red}.shared"需要 .target sm_12 或更高版本

But the compiler sais: Instruction '{atom,red}.shared' requires .target sm_12 or higher

我还尝试为每个创建的线程创建一个临时表:

I also tried have a temp table for each thread created:

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{
    unsigned int temp[buckets];
     int j;
    for (j=0;j<buckets;j++){
        temp[j]=0;
    }

    int bucketID;

    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    while (i < size)
    {
        bucketID = array[i]/Bwidth;
        temp[bucketID]++;
        i += offset;
    }


    for (j=0;j<buckets;j++){
        histo[j] += temp[j];    
    }
 }

但是编译器不允许我这样做,因为它需要一个 constant 来创建临时表.但问题是 buckets 是为命令行动态提供的.

But compiler don't let me to do it as it needs a constant to create the temp table. But the problem says that buckets are dynamically given for the command line.

还有其他方法吗?我不知道该怎么做.我很困惑.

Is there another way to do it? I don't know how to do it. I am confused.

推荐答案

使用原子时,启动更少的块将减少争用(从而提高性能),因为它不必在更少的块之间进行协调.启动更少的块,并让每个块在更多的输入元素上循环.

When using atomics, launching fewer blocks will reduce contention (and hence improve performance) because it will not have to coordinate between fewer blocks. Launch fewer blocks and have each block loop over more of the input elements.

for (unsigned tid = blockIdx.x*blockDim.x+threadIdx.x; 
              tid < size; tid += gridDim.x*blockDim.x) {
    unsigned char value = array[tid]; // borrowing notation from another answer here
    int bin = value % buckets;
    atomicAdd(&histo[bin],1);
}

这篇关于Cuda 编程直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆