将数据上传到卷积内核的共享内存中 [英] Upload data in shared memory for convolution kernel

查看：177 发布时间：2017/3/4 14:35:13 cuda gpu

本文介绍了将数据上传到卷积内核的共享内存中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在参考评论时遇到一些困难，无法理解批量载入。为了计算像素中的卷积，尺寸为5的掩模必须以该特定像素为中心。图像被分成图块。应用卷积掩码之后的这些瓦片是大小为 TILE_WIDTH * TILE_WIDTH 的最终输出瓦片。对于属于输出瓦片的边界的像素，当该瓦片属于图像的边界时，掩模必须从相邻瓦片借用一些像素。否则，这些借用值将分配为零。这两个步骤描述在

I am having some difficulties to understand the batch loading as in the comments is referred. In order to compute the convolution in a pixel the mask whose size is 5 must become centered on this specific pixel. The image is divided into tiles. These tiles after applying the convolution mask are the final output tiles whose size is TILE_WIDTH*TILE_WIDTH. For the pixels that belong to the border of the output tile the mask must borrow some pixels from the neighbor tile, when this tile belong to the borders of the image. Otherwise, these borrowed values are assigned to zero. These two steps are depicted in

if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
    N_ds[destY][destX] = I[src];
else
    N_ds[destY][destX] = 0;

因为这个原因共享内存数组有 TILE_WIDTH + Mask_width - code>维度。以下部分代码对我不清楚。


For that reason the shared memory array has TILE_WIDTH + Mask_width - 1 dimension in each side. The following parts of the code are unclear to me.

destY code> destX 索引。
将输出索引除以输入磁贴宽度是什么意思？

srcY add srcX 索引。
为什么 destY 和 destX 索引参与 srcY add srcX index？

The destY and destX index. Dividing the output index by the input tile width what does it means?
The srcY add srcX index. Why destY and destX index take part in srcY add srcX index?

srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;

srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

EDIT：在绿色中有输出图块，在红色中我们有以114索引为中心的掩码。很明显，掩码从不同的图块中借用元素。
最后，这张图片是指一个频道。

Image added. In green there are the output tiles and in red we have the mask centered in 114 index. It is obvious that the mask borrows elements from different tiles. Finally, this image refers to one channel.

例如：根据下面的图片我试过写一个例子。输出图块具有基于的 blockIdx.x = 1 和 blockIdx.y = 1 destY = 0 和 destX = 0 。此外，
srcY = 1 * 6 + 0-3 = 3 ， srcX = 3 和 src =（3 * 18 + 3）* 3 + 0 = 171 。基于计算和图像示例，我们没有匹配。在第一个共享内存中，应该存储的值是全局索引 57 的值。上述计算有什么问题？任何人可以帮助吗？

Example: Based on the image below I have tryied to wrote an example. The output tile has blockIdx.x=1 and blockIdx.y=1 based on that destY=0 and destX=0. Also, srcY = 1*6+0-3=3, srcX = 3 and src = (3*18+3)*3+0=171. Based on the calculations and the image example we do not have a match. In the first shared memory possision the value that should be stored is that with global index 57. What is wrong with the abovementioned calculations? Can any one help please?

#define Mask_width  5
#define Mask_radius Mask_width/2
#define TILE_WIDTH 16
#define w (TILE_WIDTH + Mask_width - 1)
#define clamp(x) (min(max((x), 0.0), 1.0))

__global__ void convolution(float *I, const float* __restrict__ M, float *P,
                            int channels, int width, int height) {
   __shared__ float N_ds[w][w];
   int k;
   for (k = 0; k < channels; k++) {
      // First batch loading
      int dest = threadIdx.y * TILE_WIDTH + threadIdx.x,
         destY = dest / w, destX = dest % w,
         srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius,
         srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius,
         src = (srcY * width + srcX) * channels + k;
      if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
         N_ds[destY][destX] = I[src];
      else
         N_ds[destY][destX] = 0;

      // Second batch loading
      dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
      destY = dest / w, destX = dest % w;
      srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
      srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
      src = (srcY * width + srcX) * channels + k;
      if (destY < w) {
         if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
            N_ds[destY][destX] = I[src];
         else
            N_ds[destY][destX] = 0;
      }
      __syncthreads();

      float accum = 0;
      int y, x;
      for (y = 0; y < Mask_width; y++)
         for (x = 0; x < Mask_width; x++)
            accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x];
      y = blockIdx.y * TILE_WIDTH + threadIdx.y;
      x = blockIdx.x * TILE_WIDTH + threadIdx.x;
      if (y < height && x < width)
         P[(y * width + x) * channels + k] = clamp(accum);
      __syncthreads();
   }
}

推荐答案

问题在概念上类似于我对StackOverflow的第一个问题：通过BS_X BS_Y线程移动（BS_X + 1）（BS_Y + 1）全局存储器矩阵。

Your question is similar in concept to my first question on StackOverflow: Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads.

您遇到以下问题：每个大小 TILE_WIDTHxTILE_WIDTH 的线程块应该填充大小的共享内存区域（TILE_WIDTH + Mask_width - 1）x（TILE_WIDTH + Mask_width - 1） 。

You are facing the following problem: each thread block of size TILE_WIDTHxTILE_WIDTH should fill a shared memory area of size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1).

4）通常，有两个加载的直观解释是什么？

4) Generally, what is the intuitive explanation of having two loadings?

由于共享内存区（TILE_WIDTH + Mask_width - 1）x（TILE_WIDTH + Mask_width - 1）大于块大小 TILE_WIDTHxTILE_WIDTH ，并且假设它小于 2xTILE_WIDTHxTILE_WIDTH ，那么每个线程最多只能将两个元素从全局内存移动到共享内存。这是您进行两个阶段加载的原因。

Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) is larger than the block size TILE_WIDTHxTILE_WIDTH and assuming it is smaller than 2xTILE_WIDTHxTILE_WIDTH, then each thread should move at most two elements from global memory to shared memory. This is the reason why you have a two-stages loading.

1） destY 和 destX 索引。将输出索引除以输入磁贴宽度是什么意思？

1) The destY and destX index. Dividing the output index by the input tile width what does it means?

这涉及第一个加载阶段， c $ c> TILE_WIDTHxTILE_WIDTH 来自全局内存的元素并填充共享内存区域的最上部分。

This concerns the first load stage which is appointed to load TILE_WIDTHxTILE_WIDTH elements from global memory and fills the uppermost part of the shared memory area.

So, the operation

dest = threadIdx.y * TILE_WIDTH + threadIdx.x;

平铺通用线程的2D坐标，

flattens the 2D coordinates of the generic thread while

destX = dest % w; destY = dest / w;

进行相反的操作，因为它计算通用线程相对于共享

makes the inverse operation, in that it calculates the 2D coordinates of the generic thread with respect to the shared memory area.

2） srcY 添加 srcX 索引。为什么 destY 和 destX 索引参与 srcY code> srcX index？

2) The srcY add srcX index. Why destY and destX index take part in srcY add srcX index?

srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius; srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

（blockIdx.x * TILE_WIDTH，blockIdx.y * TILE_WIDTH） code>将是全局内存位置的坐标，如果块大小和共享内存大小相同。因为你是从邻居瓦片借用内存值，那么你必须移动上述坐标（destX - Mask_radius，destY - Mask_radius）。
3）为什么在第二次加载中我们使用偏移量TILE_WIDTH * TILE_WIDTH？ 3) Why in the second loading we use the offset TILE_WIDTH * TILE_WIDTH? 您有这个偏移，因为在第一个内存阶段，您已经填充了共享内存的第一个 TILE_WIDTHxTILE_WIDTH 位置。 You have this offset because in the first memory stage you have already filled the "first" TILE_WIDTHxTILE_WIDTH locations of the shared memory. EDIT 下图说明了扁平线程索引 dest 和共享内存位置。在图片中，蓝色框表示通用瓦片的元素，而红色框表示邻居瓦片的元素。蓝色和红色框的并集对应于总体共享存储器位置。正如你所看到的，线程块的所有 256 线程都涉及在绿线上方填充共享内存的上部分，而只有 145 涉及填充绿线下面的共享内存的下部。现在您还应该了解 TILE_WIDTH x TILE_WIDTH 偏移。 The picture below illustrates the correspondence between the flattened thread index dest and the shared memory locations. In the picture, the blue boxes represent the elements of the generic tile while the red boxes the elements of the neighboor tiles. The union of the blue and red boxes correspond to the overall shared memory locations. As you can see, all the 256 threads of a thread block are involved in filling the upper part of the shared memory above the green line, while only 145 are involved in filling the lower part of the shared memory below the green line. Now you should also understand the TILE_WIDTH x TILE_WIDTH offset.
请注意， $ c> 2 由于特定的参数选择，每个线程的内存负载。例如，如果你有 TILE_WIDTH = 8 ，那么线程块中的线程数为 64 ，而共享内存大小为 12x12 = 144 ，这意味着每个线程负责至少执行 2 共享内存写入自 144/64 = 2.25 。

Please, notice that you have at most 2 memory loads per thread due to the particular choice of your parameters. For example, if you have TILE_WIDTH = 8, then the number of threads in a thread block is 64, while the shared memory size is 12x12=144, which means that each thread is in charge to perform at least 2 shared memory writes since 144/64=2.25.

这篇关于将数据上传到卷积内核的共享内存中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将数据上传到卷积内核的共享内存中 [英] Upload data in shared memory for convolution kernel

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

将数据上传到卷积内核的共享内存中 [英] Upload data in shared memory for convolution kernel

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭