在CUDA中查找矩阵的最大值 [英] Find max of matrix in CUDA

查看:318
本文介绍了在CUDA中查找矩阵的最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始在CUDA。现在我有一个问题。
我有N * N矩阵,窗口比例是8x8。我想把这个矩阵细分成多个子矩阵,并找到这个的最大值。
例如,如果我有64 * 64矩阵,所以我将有8小矩阵与8 * 8尺度,找出8最大值。最后,我将所有最大值保存到新数组,但它的顺序总是改变。我想找到解决方案,以保持它们正确的顺序

I just started in CUDA. Now I have a question. I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of this. For example if I have 64*64 matrix so I will have 8 small matrix with 8*8 scale and find out 8 max values. Finally I save all max values into new array, but its order always change. I want find solution to keep them in right order

__global__ void calculate_emax_kernel(float emap[],float emax[], int img_height, int img_width,int windows_size)
{
    int x_index = blockIdx.x*blockDim.x+threadIdx.x;
    int y_index = blockIdx.y*blockDim.y+threadIdx.y;

    int num_row_block = img_height/windows_size;
    int num_col_block = img_width/windows_size;
    __shared__ float window_elements[256];
    __shared__ int counter;
    __shared__ int emax_count;

    if (threadIdx.x == 0) emax_count = 0;
    __syncthreads();
    int index;
    int emax_idx = 0;


    if(y_index >= img_height|| x_index >= img_width) return;
    for(int i = 0; i < num_row_block; i++)
    {
        for(int j = 0; j < num_col_block; j++)
        {
            counter = 0;
            if(y_index >= i*windows_size && y_index < (i+1)*windows_size
                    && x_index >= j*windows_size && x_index < (j+1)*windows_size)
            {
                int idx = y_index*img_height + x_index;
                index = atomicAdd(&counter, 1);

                window_elements[index] = emap[idx];
                __syncthreads();


                // reduction
                unsigned int k = (windows_size*windows_size)/2;
                while(k != 0)
                {
                    if(index < k)
                    {
                        window_elements[index] = fmaxf(window_elements[index], window_elements[index+k]);

                    }
                    k /= 2;
                }
                if(index == 0)
                {
                    emax[i*num_row_block+j] = window_elements[index];
                }
            }
            __syncthreads();
        }
        __syncthreads();
    }
    __syncthreads();
}

这是我的配置

void construct_emax(float *input,float *output, int img_height, int img_width)
{
    int windows_size = 4;
    float * d_input, * d_output;
    cudaMalloc(&d_input, img_width*img_height*sizeof(float));
    cudaMalloc(&d_output, img_width*img_height*sizeof(float));

    cudaMemcpy(d_input, input, img_width*img_height*sizeof(float), cudaMemcpyHostToDevice);
    dim3 blocksize(16,16);
    dim3 gridsize;

    gridsize.x=(img_width+blocksize.x-1)/blocksize.x;
    gridsize.y=(img_height+blocksize.y-1)/blocksize.y;

    calculate_emax_kernel<<<gridsize,blocksize>>>(d_input,d_output,img_height,img_width,windows_size);

}


推荐答案

平行缩小是棘手的; 细分平行缩小更复杂。现在你是在2-D,你的段/窗口小于线程块。

With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.

对于大窗口大小,我不认为它是一个问题。你可以使用一个线程块来减少一个窗口。例如,如果你有一个16x16窗口,你可以简单地使用16x16线程块。如果你有更大的窗口大小,例如64x64,你仍然可以使用16x16线程块。在数据加载期间,首先将64x64窗口缩小为16x16个元素,然后在线程块中减少到1个标量。

For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.

对于小于块大小的窗口,减少每个线程块的多个窗口以提高性能。您可以使用当前的块/网格配置,其中每个256线程块(16x16)负责16个4x4窗口。但这不是最佳的,因为每个32线程包装组织在两个部分(2x16)。这不利于合并全局内存访问,并且很难将2x16翘曲映射到一个或多个4x4窗口,以实现高效的并行缩小。

For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.

或者,我建议您使用1-D线程块有256个线程。每个 m 线程减少一个 m x m 然后可以使用2-D网格覆盖整个图像。

Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.

const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);

在内核函数中,您可以


  1. 将每个 m x m 减少到1x m 向量;

  2. 使用树缩减方法将1x m 向量减少为标量。

  1. reduce each mxm window to a 1xm vector during global data loading;
  2. use tree reduction method to reduce the 1xm vector to a scalar.

以下代码是一个概念性演示,当 m 2的功率和 m <= 32 。您可以进一步修改它为任意 m 和更好的边界检查。

This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.

#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>

__global__ void calculate_emax_kernel(const float* input, float* output,
                                      int height, int width, int win_size,
                                      int out_width) {
  const int tid = threadIdx.x;
  const int i = blockIdx.y * win_size;
  const int j = blockIdx.x * 256 + tid;
  const int win_id = j % win_size;

  __shared__ float smax[256];

  float tmax = -1e20;
  if (j < width) {
    for (int tile = 0; tile < win_size; tile++) {
      if (i + tile < height) {
        tmax = max(tmax, input[(i + tile) * width + j]);
      }
    }
  }
  smax[tid] = tmax;
  for (int shift = win_size / 2; shift > 0; shift /= 2) {
    if (win_id < shift) {
      smax[tid] = max(smax[tid], smax[tid + shift]);
    }
  }
  if (win_id == 0 && j < width) {
    output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
  }
}

int main() {
  const int height = 1024;
  const int width = 1024;
  const int m = 4;
  thrust::device_vector<float> in(height * width);
  thrust::device_vector<float> out(
      ((height + m - 1) / m) * ((width + m - 1) / m));

  dim3 blocksize(256);
  dim3 gridsize((width + 255) / 256, (height + m - 1) / m);

  assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
  calculate_emax_kernel<<<gridsize, blocksize>>>(
      thrust::raw_pointer_cast(in.data()),
      thrust::raw_pointer_cast(out.data()),
      height, width, m, (width + m - 1) / m);

  return 0;
}

这篇关于在CUDA中查找矩阵的最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆