cuda卷积映射 [英] cuda convolution mapping

查看:368
本文介绍了cuda卷积映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图为每个线程块复制一个图像块和相对围裙到共享内存。



我的数据被复制后(我使用<强>矩阵)到共享内存,我想要一个映射共享内存中的掩码的中心的关系,我认为卷积和图像缓冲区中的掩码的中心。



我想这是因为如果我试图做图像的卷积似乎共享内存中的掩码的中心不对应于图像缓冲区中的中心存储全局内存

在下面的代码中,我写了一个简单的图像黑白侵蚀算法我把一个卷积的结果输出图像似乎中心不对应。



我使用 512x512 px图片



写下我的样本:

  //块和网格大小
dim3块(16,16)
dim3 grid(512 /(block.x),512 /(block.y),1);

这是我的内核:

  #define STREL_SIZE 5 

#define TILE_W 16
#define TILE_H 16

#define R(STREL_SIZE / 2)

//平铺图像的大小+围裙
#define BLOCK_W(TILE_W +(2 * R))
#define BLOCK_H(TILE_H +(2 * R))


__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
unsigned char * buffer_out,
int w,int h){

//数据缓存:threadIdx。 x,threadIdx.y
__shared__ unsigned char data [TILE_W + STREL_SIZE] [TILE_H + STREL_SIZE];


int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

//这个线程的全局mem地址
int gLoc = row * w + col;


int x,y; //基于图像的坐标



if((col< w)&&&(row< h)){
data [threadIdx.x] threadIdx.y] = buffer_in [gLoc];

if(threadIdx.y>(h-STREL_SIZE))
data [threadIdx.x] [threadIdx.y + STREL_SIZE] = buffer_in [gLoc + STREL_SIZE];

if(threadIdx.x>(w-STREL_SIZE))
data [threadIdx.x + STREL_SIZE] [threadIdx.y] = buffer_in [gLoc + STREL_SIZE];

if((threadIdx.x>(w-STREL_SIZE))&&(threadIdx.y>(h-STREL_SIZE)))
data [threadIdx.x + STREL_SIZE ] [threadIdx.y + STREL_SIZE] = buffer_in [gLoc + 2 * STREL_SIZE];

//等待所有线程完成读取
__syncthreads();

unsigned char min_value = 255;
for(x = 0; x for(y = 0; y min_value = min((data [threadIdx.x + x ] [threadIdx.y + y]),min_value);
}

}
buffer_out [gLoc] = min_value;
}
}

我的输入图片:





我的内核输出是:





其中 w 是图片的宽度, 512 ,
其中 h 是图片的高度,等于 512



i使用以下命令调用内核:

  erode_multiple_img_SM<<<< grid,block>>>(dimage_src,dimage_dst ,512,512); 

dimage_src 是输入图像数组缓冲区不是矩阵, dimage_dst 是输出图像的缓冲区。



每个缓冲区的大小为 nElem * nImg * sizeof(unsigned char) strong>其中 nElem = 512 * 512 是缓冲区的大小, nImg 是我想要处理的图片数量等于 1
我错了?



CODE UPDATE: $ p> __ global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
unsigned char * buffer_out,
int w,int h){

//数据高速缓存:threadIdx.x,threadIdx.y
__shared__ unsigned char data [TILE_W + STREL_SIZE-1] [TILE_H + STREL_SIZE-1];

//此线程的全局mem地址
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;



int gLoc = row * w + col;



//每个线程将四个值从全局内存加载到共享内存
int x,y; //基于图像的坐标



if((col< w)&&&(row< h)){

data [threadIdx .x] [threadIdx.y] = buffer_in [gLoc];

if(threadIdx.y>(TILE_H-STREL_SIZE + 1))
data [threadIdx.x] [threadIdx.y + STREL_SIZE-1] = buffer_in [(row + STREL_SIZE- 1)* w + col];

if(threadIdx.x>(TILE_W-STREL_SIZE + 1))
data [threadIdx.x + STREL_SIZE-1] [threadIdx.y] = buffer_in [row * w + col + STREL_SIZE-1];

if((threadIdx.x>(TILE_W-STREL_SIZE + 1))&&(threadIdx.y>(TILE_H-STREL_SIZE + 1)))
data [threadIdx .x + STREL_SIZE-1] [threadIdx.y + STREL_SIZE-1] = buffer_in [(row + STREL_SIZE-1)* w + col + STREL_SIZE-1];

//等待所有线程完成读取
__syncthreads();



unsigned char min_value = 255;
for(x = 0; x for(y = 0; y min_value = min((data [threadIdx.x + x ] [threadIdx.y + y]),min_value);
}

}
buffer_out [gLoc] = min_value;
}

}



现在的输出是: p>



UPDATED 2(版本2 - 工作) :



我实现了另一个版本的算法。为了这样做,我遵循



我实现了一个方案,显示 Eric 描述的算法部分如何加载共享内存中的 TILE 像素:



解决方案

[20]共享mem,而不是[21] [21]。应改为

  __ shared__ unsigned char data [TILE_W + STREL_SIZE-1] [TILE_H + STREL_SIZE-1]; 

另一个问题是数据加载。正确的方法是从输入到共享存储器读取(16 + 4)x(16 + 4)个像素,共同使用(16 x 16)线程。这可以分为 4个部分:



1)第一部分 15)加载像素(0:15,0:15)



2)第二部分加载像素(0:15,16:19)



3)第三部分 (16:19,0:15)



4)第四部分:线程(12:15,12:15) :19,16:19)



但是在你的代码中你搞乱了索引。
对于第2〜4部分,只有线程块中的一些线程将工作,并且还需要额外的边界检查。



对于第二部分,您应该使用线程(0:15,12:15)读取像素(0:15,16:19)

  if(threadIdx.y>(TILE_H-STREL_SIZE))
data [threadIdx.x] [threadIdx.y + STREL_SIZE- 1] = row + STREL_SIZE-1

第3部分和第4部分需要类似修改

  if(threadIdx.x>(TILE_W-STREL_SIZE))
data [threadIdx.x + STREL_SIZE-1] [threadIdx.y] = col + STREL_SIZE- 1
if((threadIdx.x>(TILE_W-STREL_SIZE))&&(threadIdx.y>(TILE_H-STREL_SIZE)))
data [threadIdx.x + STREL_SIZE -1] [threadIdx.y + STREL_SIZE-1] =(row + STREL_SIZE-1< h& col + STREL_SIZE-1< w)? buffer_in [(row + STREL_SIZE-1)* w + col + STREL_SIZE-1]:0;

然后你应该能够得到正确的结果图片,虽然会有2x2像素漂移,因为你做卷积(0 ... 4,0 ... 4)而不是(-2。
.2,-2 ... 2)。



有关详情,请阅读



http://igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf



https://www.evl.uic.edu/sjames/ cs525 / final.html


i'm trying to copy for each block of threads a patch of image and relative apron to shared memory.

After my data are copyied(i used a matrix) to shared memory, i want a relations that map the center of the mask in shared memory that i consider for convolution and the center of the mask in the image buffer.

I want that because if i try to do convolution of image seems that the center of the mask in shared memory doesn't correspond to the center in the image buffer stored in global memory.

In the code below i write an example of simple image black and white erosion algorithm , when i put the result of a convolution to the output image seems that the center not corresponds.

i write my sample using a 512x512px image

my settings are:

//block and grid size
dim3 block(16,16);
dim3 grid(512/(block.x),512/(block.y),1);

this is my kernel:

#define STREL_SIZE 5

#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)

//size of tile image + apron
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))


 __global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

    // Data cache: threadIdx.x , threadIdx.y
    __shared__ unsigned char data[TILE_W +STREL_SIZE  ][TILE_H +STREL_SIZE ];


     int col = blockIdx.x * blockDim.x + threadIdx.x;
     int row = blockIdx.y * blockDim.y + threadIdx.y;

     // global mem address of this thread
     int gLoc =  row*w +col;


     int x, y;  // image based coordinate



     if((col<w)&&(row<h)) {
         data[threadIdx.x][threadIdx.y]=buffer_in[gLoc];

     if (threadIdx.y > (h-STREL_SIZE))
          data[threadIdx.x][threadIdx.y + STREL_SIZE]=buffer_in[gLoc + STREL_SIZE];

     if (threadIdx.x >(w-STREL_SIZE))
          data[threadIdx.x + STREL_SIZE][threadIdx.y]=buffer_in[gLoc+STREL_SIZE];

     if ((threadIdx.x >(w-STREL_SIZE)) && (threadIdx.y > (h-STREL_SIZE)))
          data[threadIdx.x+STREL_SIZE][threadIdx.y+STREL_SIZE] =     buffer_in[gLoc+2*STREL_SIZE];

     //wait for all threads to finish read
     __syncthreads();

      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }
}

my input image:

my the output of the kernel is:

where w is the width of image,and is equal 512, where h is the height of image,and is equal 512.

i call the kernel with:

 erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

the dimage_src is the input image an array buffer not a matrix, and dimage_dst is the output image a buffer.

each buffer have the size of nElem * nImg * sizeof(unsigned char) where nElem=512*512 is the size of the buffer and nImg is the number of image that i want processing in my case is equal to 1. where i'm wrong?

CODE UPDATE:

__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W + STREL_SIZE-1 ][TILE_H + STREL_SIZE-1 ];

// global mem address of this thread
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;



int gLoc =  row*w +col;



// each threads loads four values from global memory into shared mem
int x, y;   // image based coordinate



if((col<w)&&(row<h)) {

    data[threadIdx.x][threadIdx.y] = buffer_in[gLoc];

     if (threadIdx.y > (TILE_H-STREL_SIZE+1))
          data[threadIdx.x][threadIdx.y + STREL_SIZE-1]=buffer_in[(row + STREL_SIZE-1)*w + col];

     if (threadIdx.x > (TILE_W-STREL_SIZE+1))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = buffer_in[row*w+col + STREL_SIZE-1];

     if ((threadIdx.x > (TILE_W-STREL_SIZE+1)) && (threadIdx.y > (TILE_H-STREL_SIZE+1)))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1];

    //wait for all threads to finish read
     __syncthreads();



      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }

    }

my output now is:

UPDATED 2(version 2 -working-):

i have implemented another version of algorithm.To do that i follow that slide that i found very useful and well explained,in particular the part in wich the author talk about convolution slide 27.

i change the block and grid settings to :

dim3 block(20,20);
dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);

the kernel call instead ramain the same:

erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

where the argument of the kernel are:

  1. dimage_src: buffer of unsigned char with size height x width that contain input image.
  2. dimage_dst:**buffer of unsigned char with size **height x width that contain output image, that my kernel produced.
  3. 512: the third argument is the width of the image.
  4. 512: the fourth argument is the height of the image.

remember my image sample are black and white but this version of erosion can work with grayscale too.

here my working Kernel:

#define STREL_W 5
#define STREL_H 5

#define STREL_SIZE 5


#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)


#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))

__global__ void erode_multiple_img_working(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){


    __shared__ unsigned char fast_acc_mat[BLOCK_w][BLOCK_H];

    int ty = threadIdx.y;
    int tx = threadIdx.x;


    int row_o = blockIdx.y * TILE_W + ty;
    int col_o = blockIdx.x * TILE_H + tx;


    int row_i = row_o - R;
    int col_i = col_o - R;

    //in of img size
    if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){

        fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i];

    }
    else{

        fast_acc_mat[ty][tx] = 0;

    }


    __syncthreads();





    if( ty < TILE_H && tx < TILE_W ){

        unsigned char min_val=255;
        for(int i = 0; i < STREL_SIZE; i++) {
            for(int j = 0; j < STREL_SIZE; j++) {

                min_val = min( fast_acc_mat[i+ty][j+tx] , min_val );

            }
        }
        if(row_o < h && col_o < w)
                buffer_out[row_o * w + col_o] = min_val;

        }

     }

and this is my eroded image(output):

I realized a scheme that show how the part of the algorithm described by Eric load pixel of a TILE in shared memory :

解决方案

You need only [20][20] shared mem, rather than [21][21]. It should be changed to

__shared__ unsigned char data[TILE_W + STREL_SIZE-1][TILE_H + STREL_SIZE-1];

Another problem is the data loading. The correct way is to read (16+4) x (16+4) pixels from input to share memory, using (16 x 16) threads collaboratively. This can be divided into 4 parts:

1)first part: thread(0:15, 0:15) load pixels (0:15,0:15)

2)second part: thread(0:15,12:15) load pixels (0:15, 16:19)

3)third part: thread(12:15,0:15) load pixels (16:19,0:15)

4)fourth part: thread(12:15,12:15) load pixels (16:19,16:19)

But in your code you are messing up the indexing. For part 2~4, only some of the threads in the thread block will be working, and additional boundary checking is also required.

For the 2nd part, you should use thread(0:15, 12:15) to read pixel(0:15, 16:19) as

 if (threadIdx.y > (TILE_H-STREL_SIZE))
      data[threadIdx.x][threadIdx.y + STREL_SIZE-1] = row + STREL_SIZE-1<h ? buffer_in[(row + STREL_SIZE-1)*w + col] : 0;

The 3rd and the 4th part require similar modifications as

 if (threadIdx.x > (TILE_W-STREL_SIZE))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = col + STREL_SIZE-1<w ? buffer_in[row*w+col + STREL_SIZE-1] : 0;

 if ((threadIdx.x > (TILE_W-STREL_SIZE)) && (threadIdx.y > (TILE_H-STREL_SIZE)))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = (row + STREL_SIZE-1<h && col + STREL_SIZE-1<w) ? buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1] : 0;

Then you should be able to get the correct result image, although there will be 2x2 pixel shift, because you do the convolution on (0...4, 0...4) rather than (-2. .2, -2...2).

For more details, you could read

http://igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf

https://www.evl.uic.edu/sjames/cs525/final.html

这篇关于cuda卷积映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆