CUDA为2D网格数据估计每个块的线程和块数 [英] CUDA estimating threads per blocks and block numbers for 2D grid data

查看:390
本文介绍了CUDA为2D网格数据估计每个块的线程和块数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我仔细阅读了所有类似的问题:


  1. 确定每个块的线程数和每个网格的块数

  2. 每个线程的线程数,每个线程的线程数< a>

  3. CUDA块和主题

  4. 扭曲和最佳块数

我的目的是尝试为前馈神经网络库动态计算(而不是硬编码值)正在开发。



我的数据不是正方形网格(矩阵),就像我看过的大多数示例一样,两个向量产生一个矩阵,列不等于行:

  float x [6] {1.f,1.f, 0.f,1.f,1.f,0.f}; 
thrust :: device_vector< float> in_vec(x,x + 6);
float y [9] {1.f,1.f,1.f,1.f,1.f,1.f,1.f,1.f,1.f};
thrust :: device_vector< float> w_vec(y,y + 9);
thrust :: device_vector< float> o_wec(9);
thrust :: device_vector< float> mtx_vec(9 * 6);

float * i_ptr = thrust :: raw_pointer_cast(in_vec.data());
float * w_ptr = thrust :: raw_pointer_cast(w_vec.data());
float * out_ptr = thrust :: raw_pointer_cast(mtx_vec.data());

dim3 threadsPerBlock(9,6);
dim3 numBlocks(1,1);
prop_mtx<<< numBlocks,threadsPerBlock>>>(w_ptr,i_ptr,out_ptr,6)

和内核:

  __ global__ void prop_mtx(float * w,float * i,float * o,int s)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
o [y + x * s] = w [x] * i [y];
}

我采用这种方法的原因是因为它在ANN计算,当涉及到矢量/矩阵计算。
我想保持这个一致,并且AFAIK使用二维网格为重量*输入计算是合理的。



我必须计算我的每个块的线程



我使用的是GTX 660,它有:

  CUDA功能主要/次要版本号:3.0 
全局内存总量:2047 MB​​ytes
(5)多处理器,(192)CUDA内核/ MP:960 CUDA核心
最大纹理尺寸尺寸(x,y,z)1D =(65536),2D =(65536,65536),3D =(4096,4096,4096)
经纱尺寸:32
每个多处理器的最大线程数:2048
每个块的最大线程数:1024
线程块(x,y,z)的最大维度大小:(1024,1024,64)
网格大小(x,y,z)的最大尺寸大小:(2147483647,65535,65535)

我想了解如何推导/计算网格大小,每个块的线程数和块数。



让我们假设我有一个权重向量包含800个项目,以及6500个项目的输入向量。


  1. 这是否意味着我真正需要的是2D网格的800,6500?

我知道我的每个块的最大线程是1024,但是因为它的2D网格,它更有可能是:

  dim3 threadPerBlock(X,Y); 




  1. 事实上,我的网格不是一个方阵,我需要以不同的方式计算每个块的X,Y线程。



最后,由于我的线程变形size是32,


  1. 需要最小网格尺寸至少为32,或32的倍数?每个区块需要至少 32个线程,或最小数目为32的网格大小?



我试过的是计算我的2D网格大小,通过将我的数据除以32包裹大小。
然后我考虑通过使用可用的SM计算网格线程。例如

  800权重/ 5 SM,= 160 x每SM 
6500输入/ 5 SM,= 1300 y每SM

但我不知道该怎么办。
最后,我考虑先找到输入重量比:

  6500/800 = 8.125 

表示使用32的最小网格大小为X,
Y必须乘以8.125 * 32
因此,我的threadsPerBlock将是:

  dim3 threadsPerBlock(32,260) 

当然,每个块有8320个线程,远远超过每个块1024个。 >

这是我的问题:我如何在每个块中不超过1024个线程,同时保留我的数据的正确网格大小?

PS:我的问题不是关于优化代码,而是理解如何通过设备分发线程和网格数据。

解决方案

对计算问题进行分类的一种方法是讨论转换和缩减



减少是一类问题,它占用大量的输入数据集大小,并产生小的输出数据集大小。例如,拍摄图像并找到最大像素值将是减少。对于这个讨论,我们将忽略减少。



转换是输出数据集大小(元素数量)为大或大致相同作为输入数据集大小。例如,拍摄图像和产生模糊图像将是一种变换。



对于变换,一种常见的方法写一个cuda内核(线程代码)将使一个唯一的线程负责输出数组中的每一个点。因此,我必须拥有的线程的总最小数量等于我的输出数组的大小。线程代码只是输入数据所需的一组计算,以便产生一个输出数据点。粗略地说,你的问题和简化的内核,符合这个定义;它是一个转换。



按照上面的线程策略,我们需要在我们的网格中的总线数等于我需要创建的输出点的总数。对于2D问题,通常方便地考虑这些二维,并且CUDA提供2D(或3D)线程块组织和2D(或3D)网格组织为此目的。



CUDA线程块尺寸的选择通常有点随意。一般来说,我们通常希望针对每个块范围的128 - 512个线程中的线程块(出于其他地方的原因),并且我们需要线程块是32的整数倍(warp的大小),以便在线程块获得细分为经纱,这是CUDA执行的实际单位。在当前支持的GPU上,线程块限制为每个块1024个线程(总数 - 即维度的乘积)。然而,对于许多问题,在该范围内的线程块选择(例如256个线程与512个线程)通常对性能具有相对较小的影响。为了让某事工作,我们现在不会出现细节。 (当您返回优化时,您可以重新选择此选项。)



到目前为止,我们已经了解到,对于这个问题类型,我们需要一个总数线程来覆盖我们的问题空间,我们将有一个任意的线程块维度选择。所以让我们选择(32,16)(x,y)开始,总共512个线程。没有规则,说明theadblocks需要正方形,或者网格需要正方形,或者在线程块大小和问题大小(或网格维度)之间甚至应该有任何比例的奇偶校验。



现在我们有一个threadblock选择(32,16),我们必须问自己我需要多少?这个问题是2D的,所以我们选择了一个2D线程块来简化线程代码中的索引生成。让我们选择一个2D网格 - 它对于2D问题是有意义的,并且对于2D简单的索引生成也是有意义的。因此,我们可以独立地考虑这两个维度。



那么,我需要在x方向上有多少块?我需要至少与(我的问题大小在x)/(我的线程块大小在x)。因为我们在这里处理所有的整数,这就提出了一个问题:如果我的问题大小不能被我的线程大小均匀地分割?规范的解决方案是启动足够的线程覆盖空间,或足够的块来覆盖空间。但在不可均分的情况下,这将导致额外线程。我们将在稍后讨论和处理这些问题。因此,如果我有一个像这样的dim3变量为threadblock维度:

  #define BX 32 
#define BY 16
...
dim3 block(BX,BY);

然后我可以这样构建我的dim3网格变量:

  #define DX 800 
#define DY 6500
...
dim3 grid((DX + block.x-1)/ block.x,(DY + block.y-1)/block.y);

如果你通过这个算术,你会看到这会导致我们启动 在x和y方向,以便我们至少有足够的线程来覆盖我们的问题空间(DX,DY),每个输出点一个线程。



希望很明显,Y维度是单独处理的,并且与x维度无关。



上述计算通常会导致生成太多线程在我的网格。我将有一些额外的线程超出我的问题空间(DX,DY),我需要处理的尽头。我们希望这些线程不做任何事情。处理这种情况的规范方法是将问题空间维度传递给我的内核,在我的内核中创建一个适当的全局唯一的线索引,然后将该索引与我的问题空间中的最大索引进行比较。



以你的内核为例,它可能看起来像这样:

  __ global__ void prop_mtx(float * w,float * i,float * o,int s,const size_t d_size_x,const size_t d_size_y)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if((x< d_size_x)&&(y< d_size_y))//线程检查
o [y + x * s] = w [x] * i [y]
}

请注意,这种线程检查将创建线程不参与在后续代码中。这里要注意的一点是, __ syncthreads()的使用取决于块中的所有线程。因此,在这种情况下,我们不应直接使用 __ syncthreads()。相反,我们必须适当地调整threadblock行为:

  __ global__ void prop_mtx(float * w,float * i,float * int s,const size_t d_size_x,const size_t d_size_y)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if((x< d_size_x)&&(y< d_size_y))//线程检查
{
o [y + x * s] = w [x] * i [y];
//和不依赖于__syncthreads()的其他代码
}
//现在可以安全使用,因为所有线程都参与
__syncthreads();
if((x< d_size_x)&&(y< d_size_y))//线程检查
{
//内核代码的其余部分
}
}

注意,可以有更少的线程执行必要的计算,输出数据点数。线程和输出数据之间的1:1对应是一种简单的方式来思考和编写cuda内核代码,但它不是唯一的方法。另一种可能的方法是使用某种形式的网格跨度循环,使得较小的网格可以覆盖更大的问题空间。对这些策略的讨论不在本答案的范围之内,在解答其他方法之前,应该先了解此答案中讨论的基本方法。


Let me start by saying that I've read carefully all similar questions on SO:

  1. Determining threads per block and block per grid
  2. Threads per SM, threads per block
  3. CUDA Blocks and Threads
  4. Warps and optimal number of blocks

My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I am developing.

My data is not a square lattice (a matrix) as is often with most examples I've seen, it is instead two vectors producing a matrix, with unequal rows to columns:

float x[6] {1.f, 1.f, 0.f, 1.f, 1.f, 0.f}; 
thrust::device_vector<float> in_vec( x, x+6 );
float y[9] {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f};
thrust::device_vector<float> w_vec( y, y+9 );
thrust::device_vector<float> o_wec(9);
thrust::device_vector<float> mtx_vec( 9 * 6 );

float * i_ptr = thrust::raw_pointer_cast( in_vec.data() );
float * w_ptr = thrust::raw_pointer_cast( w_vec.data() );
float * out_ptr = thrust::raw_pointer_cast( mtx_vec.data() );

dim3 threadsPerBlock(9,6);
dim3 numBlocks(1,1);
prop_mtx<<<numBlocks,threadsPerBlock>>>( w_ptr, i_ptr, out_ptr, 6 );

and the kernel:

__global__ void prop_mtx( float * w, float * i, float * o, int s ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    o[y + x * s] = w[x] * i[y];
}

The reason why I've taken this approach is because it makes sense in ANN computation, when it comes to vector/matrix calculations. I'd like to keep this consistent, and AFAIK using a 2D grid for Weight * Input calculations is reasonable.

I have to compute my threads per block as a 2D with unequal numbers of threads in the grid.

I am ussing a GTX 660, which has:

  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2047 MBytes 
  ( 5) Multiprocessors, (192) CUDA Cores/MP:     960 CUDA Cores
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

I am trying to understand how I can deduce/compute the grid size, threads per block, and number of blocks.

Let us assume I have a weight vector of 800 items, and an input vector of 6500 items.

  1. Does this imply that what I really need, is a 2D grid of 800,6500? As far as I understand, anything else will provide incorrect results?

I know my maximum threads per block is 1024, but because its a 2D grid, it would more likely be:

dim3 threadPerBlock(X,Y);

  1. Due to the fact that my grid is not a square matrix, I need to calculate the X, Y threads per block in a different way?

  2. Or I need to deduce the number of blocks needed first?

Finally, since my thread warp size is 32,

  1. Does the minimum grid size, regardless of all other parameters need to be at least 32, or a multiple of 32? Do I need at least 32 threads per block, or a grid size where the smallest number is 32?

Any pseudo-code, or explanation of how I should go about this, would be greatly appreciated.

What I have tried, is to calculate my 2D grid size, by dividing my data by 32 wrap size. Then I considered calculating the grid threads by using the available SMs. For example

800 weights / 5 SM, = 160 x's per SM
6500 inputs  / 5 SM, = 1300 y's per SM

But I didn't know what to do from there on. Finally, I considered finding the input-weight ratio first:

6500/800 = 8.125

Implying that using the 32 minimum grid size for X, Y would have to be multiplied by 8.125 * 32 Hence, my threadsPerBlock would be:

dim3 threadsPerBlock(32,260);

That is of course, 8320 threads per block, which far exceeds the 1024 per block.

So this is my issue: how do I not exceed the 1024 threads per block, whilst retaining the correct grid size of my data?

PS: My question is not about optimising the code, but understanding how to distribute the threads and grid data over the device.

解决方案

One approach to categorizing computation problems is to discuss transformations and reductions.

A reduction is a category of problem which takes a large input data set size, and produces a small output data set size. For example, taking an image and finding the maximum pixel value would be a reduction. For this discussion, we will ignore reductions.

A transformation is a category of computation where the output data set size (number of elements) is either "large" or "approximately the same" as the input data set size. For example, taking an image and producing a blurred image would be a transformation.

For transformations, a common approach ("thread strategy") to writing a cuda kernel (the thread code) will be to make one unique thread responsible for each point in the output array. Therefore, the total minimum number of threads that I must have is equal to the size of my output array. The thread code is just the set of computations needed on the input data, in order to produce one output data point. Roughly speaking then, your problem, and simplified kernel, fit this definition; it is a transformation.

Following the above thread strategy, we will need a total number of threads in our grid equal to the total number of output points I need to create. For 2D problems, it is often convenient to think about these two-dimensionally, and CUDA provides 2D (or 3D) threadblock organization and 2D (or 3D) grid organization, for this purpose.

Choice of CUDA threadblock dimensions is often somewhat arbitrary. Generally speaking, we typically want to aim for threadblocks in the 128 - 512 threads per block range (for reasons that are covered elsewhere) and we want threadblocks that are whole-number multiples of 32 (the warp size) for efficiency when the threadblock gets subdivided into warps, which are the actual unit of CUDA execution. On currently supported GPUs, threadblocks are limited to 1024 threads per block (total - i.e. the product of the dimensions). However, for many problems, threadblock choices within this range (e.g. 256 threads vs. 512 threads) often have relatively little impact on performance. In the interest of getting something working, we don't sweat the details at this point. (When you're coming back for optimization, you may revisit this choice.)

So far we've learned that for this problem type, we need a total number of threads to cover our problem space, and we will have a somewhat arbitrary threadblock dimension choice. So let's choose (32,16) (x,y) to start with, for a total of 512 threads. There are no rules that state that theadblocks need be "square", or that grids need be "square", or that there should even be any sort of ratiometric parity between threadblock dimensions and problem size (or grid dimensions.)

Now that we have a threadblock choice of (32,16) in mind, we must ask ourselves "how many of these do I need?". This problem is 2D and so we've chosen a 2D threadblock for simplicity of index generation in the thread code. Let's choose a 2D grid as well - it makes sense for a 2D problem, and again for 2D simplicity of index generation. So we can consider the two dimensions independently.

So, how many blocks do I need in the x-direction? I need at least as many as (my problem size in x)/(my threadblock size in x). Since we are dealing with all integers here, this begs the question "what if my problem size is not evenly divisible by my threadblock size?" The canonical solution is to launch more than enough threads to cover the space, or enough blocks to cover the space. But in the non-evenly-divisible case, this will result in "extra threads". We'll discuss and deal with these shortly. Therefore, if I have a dim3 variable like this for threadblock dimensions:

    #define BX 32
    #define BY 16   
    ...
    dim3 block(BX,BY);

then I might construct my dim3 grid variable like this:

    #define DX 800
    #define DY 6500
    ...
    dim3 grid((DX+block.x-1)/block.x, (DY+block.y-1)/block.y);

If you work through this arithmetic, you will see that this causes us to launch enough blocks in the x and y direction, so that we will have at least enough threads to cover our problem space of (DX,DY), one thread per output point.

Hopefully it is clear that the Y dimension is treated separately and independently from the x-dimension.

The above calculations will usually result in the generation of "too many" threads in my grid. I will have some "extra threads" beyond the end of my problem space (DX, DY) that I need to handle. We want these threads to "do nothing". The canonical way to handle this, is to pass the problem space dimensions to my kernel, create an appropriate globally unique thread index in my kernel, then compare that index to the maximum index in my problem space. If it exceeds it, we simply have that thread skip all remaining thread code.

Using your kernel as an example, it might look like this:

__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      o[y + x * s] = w[x] * i[y];
}

Note that such a thread check will create threads (in some blocks) that are "not participating" in the subsequent code. A point to be aware of here is that the usage of __syncthreads() depends on all threads in a block participating. Therefore, we should not use __syncthreads() directly in such a case. Instead, we have to condition threadblock behavior appropriately:

__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y ) 
{
    int x = blockIdx.x * blockDim.x + threadIdx.x; 
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      {
         o[y + x * s] = w[x] * i[y];
         // and other code not dependent on __syncthreads()
       }
     // now it is safe to use since all threads are participating
     __syncthreads();
    if ((x < d_size_x) && (y < d_size_y))  // thread check
      {
          // rest of kernel code
       }
}

Note that it is possible to have a smaller number of threads perform the necessary computations for a larger number of output data points. The 1:1 correspondence between threads and output data is an easy way to think about and write the cuda kernel code, but it's not the only way. One other possible method would be to use some form of a grid-striding loop, so that a smaller grid can cover a larger problem space. Discussion of those strategies is outside the scope of this answer, and the basic methodology discussed in this answer should be understood before tackling other approaches.

这篇关于CUDA为2D网格数据估计每个块的线程和块数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆