确定CUDA中#pragma展开N的最佳值 [英] Determining the optimal value for #pragma unroll N in CUDA

查看:192
本文介绍了确定CUDA中#pragma展开N的最佳值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解 #pragma展开的工作方式,但是如果我有以下示例:

I understand how #pragma unroll works, but if I have the following example:

__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
  int j = threadIdx.x + blockIdx.x * blockDim.x;
  if (j < array_size) {
     #pragma unroll
     for (int i = 0; i < LIMIT; i++) {
       A_out[i] = B[i] + C[i];
     }
  }
}

我想确定最优值上面的内核中 LIMIT 的值,它将以 x 线程数和 y启动块数。 LIMIT 的范围可以从 2 1<< 20 。由于100万对于变量来说似乎是一个很大的数字(展开的100万个循环将导致寄存器压力,并且我不确定编译器是否会进行展开),那么公平的数字是多少(如果有)?以及如何确定该限制?

I want to determine the optimal value for LIMITin the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?

推荐答案

您的示例内核是完全串行的,在任何情况下都不是一个有用的现实用例循环展开,但让我们局限于编译器将执行多少循环展开的问题。

Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.

这是内核的可编译版本,带有一些模板修饰:

Here is a compileable version of your kernel with a bit of template decoration:

template<int LIMIT>
__global__ void
test_kernel( const float* B, const float* C, float* A_out, int array_size)
{
  int j = threadIdx.x + blockIdx.x * blockDim.x;
  if (j < array_size) {
     #pragma unroll
     for (int i = 0; i < LIMIT; i++) {
       A_out[i] = B[i] + C[i];
     }
  }
}

template __global__ void test_kernel<4>(const float*, const float*, float*, int);
template __global__ void test_kernel<64>(const float*, const float*, float*, int);
template __global__ void test_kernel<256>(const float*, const float*, float*, int);
template __global__ void test_kernel<1024>(const float*, const float*, float*, int);
template __global__ void test_kernel<4096>(const float*, const float*, float*, int);
template __global__ void test_kernel<8192>(const float*, const float*, float*, int);

您可以将其编译为PTX并亲自查看(至少使用CUDA 7版本编译器和默认的计算能力2.0目标体系结构),完全展开了 LIMIT = 4096 的内核。 LIMIT = 8192 的情况未展开。如果您有更多的耐心,可以尝试使用模板为该代码找到确切的编译器限制,尽管我怀疑知道这样做特别有启发性。

You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to LIMIT=4096are fully unrolled. The LIMIT=8192 case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.

您还可以通过编译器自己查看所有严重展开的版本都使用相同数量的寄存器(由于内核的琐碎性质)。

You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).

这篇关于确定CUDA中#pragma展开N的最佳值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆