Cuda编程直方图 [英] Cuda programming histogram

查看:256
本文介绍了Cuda编程直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行一个cuda程序,但我是一个初学者。我必须为直方图编写程序。
但是用桶。根据maxValue(示例中为40),该数字将添加到相应的存储桶。如果我们有4个bucket:


histo:| 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |



0-9(第1个桶)



10-19

20-29(第3个桶)



30-39(第4个桶)


< blockquote>

我的GPU具有 Compute Capability 1.1。



我试图做每个线程在他的临时表上添加他的值的块的共享temp []:

  __ global__ void histo_kernel_optimized5 * buffer,long size,
unsigned int * histo)
{
extern __shared__ unsigned int temp [];
temp [threadIdx.x] = 0;
__syncthreads();

int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
int bucketID;
while(i {
bucketID = array [i] / Bwidth;
atomicAdd(& temp [bucketID],1);
i + = offset;
}
__syncthreads();


atomicAdd(&(histo [threadIdx.x]),temp [threadIdx.x]);
}

histo_kernel_optimized<<< array_size / buckets,buckets,buckets * sizeof(unsigned int)>>(buffer,SIZE,histogram)

但是编译器sais:
指令'{atom,red} .shared' sm_12或更高版本



我也尝试为每个创建的线程创建一个临时表:

  __ global__ void histo_kernel_optimized5(unsigned char * buffer,long size,
unsigned int * histo)
{
unsigned int temp [buckets];
int j;
for(j = 0; j< buckets; j ++){
temp [j] = 0;
}

int bucketID;

int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while(i {
bucketID = array [i] / Bwidth;
temp [bucketID] ++;
i + = offset;
}


for(j = 0; j< buckets; j ++){
histo [j] + = temp [j]
}
}

但编译器不允许我这样做它需要一个常量来创建临时表。但问题是,存储桶是动态给出的命令行。



有另一种方法吗?我不知道该怎么做。我很困惑。

解决方案

使用原子时,启动较少的块会减少争用(从而提高性能),因为它不会以在较少的块之间协调。启动较少的块,并让每个块循环遍历更多的输入元素。

  for(unsigned tid = blockIdx.x * blockDim。 x + threadIdx.x; 
tid< size; tid + = gridDim.x * blockDim.x){
unsigned char value = array [tid]; //从另一个答案借用符号
int bin = value%buckets;
atomicAdd(& histo [bin],1);
}


I want to run a cuda program, but I am a beginner. I have to write a program for a histogram. But with buckets. Depending on the maxValue(40 in the example) the number will be added to the appropriate bucket. If we have 4 buckets:

histo: | 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |

0-9 (1st bucket)

10-19(2nd bucket)

20-29(3rd bucket)

30- 39(4th bucket)

My GPU has Compute Capability 1.1.

I was trying to do something like having a shared temp[] for a block that each thread is adding his values on his temp table:

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{
     extern __shared__ unsigned int temp[];
     temp[threadIdx.x] = 0;
     __syncthreads();

     int i = threadIdx.x + blockIdx.x * blockDim.x;
     int offset = blockDim.x * gridDim.x;
     int bucketID;
     while (i < size)
     {
              bucketID = array[i]/Bwidth;
              atomicAdd( &temp[bucketID], 1);
              i += offset;
     }
     __syncthreads();


    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

histo_kernel_optimized <<<array_size/buckets, buckets,buckets*sizeof(unsigned int)>>>(buffer,SIZE, histogram)

But the compiler sais: Instruction '{atom,red}.shared' requires .target sm_12 or higher

I also tried have a temp table for each thread created:

__global__ void histo_kernel_optimized5( unsigned char *buffer, long size,
                               unsigned int *histo )
{
    unsigned int temp[buckets];
     int j;
    for (j=0;j<buckets;j++){
        temp[j]=0;
    }

    int bucketID;

    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int offset = blockDim.x * gridDim.x;
    while (i < size)
    {
        bucketID = array[i]/Bwidth;
        temp[bucketID]++;
        i += offset;
    }


    for (j=0;j<buckets;j++){
        histo[j] += temp[j];    
    }
 }

But compiler don't let me to do it as it needs a constant to create the temp table. But the problem says that buckets are dynamically given for the command line.

Is there another way to do it? I don't know how to do it. I am confused.

解决方案

When using atomics, launching fewer blocks will reduce contention (and hence improve performance) because it will not have to coordinate between fewer blocks. Launch fewer blocks and have each block loop over more of the input elements.

for (unsigned tid = blockIdx.x*blockDim.x+threadIdx.x; 
              tid < size; tid += gridDim.x*blockDim.x) {
    unsigned char value = array[tid]; // borrowing notation from another answer here
    int bin = value % buckets;
    atomicAdd(&histo[bin],1);
}

这篇关于Cuda编程直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆