确定CUDA中#pragma展开N的最佳值 [英] Determining the optimal value for #pragma unroll N in CUDA
问题描述
我了解 #pragma展开
的工作方式,但是如果我有以下示例:
I understand how #pragma unroll
works, but if I have the following example:
__global__ void
test_kernel( const float* B, const float* C, float* A_out)
{
int j = threadIdx.x + blockIdx.x * blockDim.x;
if (j < array_size) {
#pragma unroll
for (int i = 0; i < LIMIT; i++) {
A_out[i] = B[i] + C[i];
}
}
}
我想确定最优值上面的内核中 LIMIT
的值,它将以 x
线程数和 y启动
块数。 LIMIT
的范围可以从 2
到 1<< 20
。由于100万对于变量来说似乎是一个很大的数字(展开的100万个循环将导致寄存器压力,并且我不确定编译器是否会进行展开),那么公平的数字是多少(如果有)?以及如何确定该限制?
I want to determine the optimal value for LIMIT
in the kernel above which will be launched with x
number of threads and y
number of blocks. The LIMIT
can be anywhere from 2
to 1<<20
. Since 1 million seems like a very big number for the variable (1 million loops unrolled will cause register pressure and I am not sure if the compiler will do that unroll), what is a "fair" number, if any? And how do I determine that limit?
推荐答案
您的示例内核是完全串行的,在任何情况下都不是一个有用的现实用例循环展开,但让我们局限于编译器将执行多少循环展开的问题。
Your example kernel is completely serial and not in anyway a useful real world use case for loop unrolling, but let's restrict ourselves to the question of how much loop unrolling the compiler will perform.
这是内核的可编译版本,带有一些模板修饰:
Here is a compileable version of your kernel with a bit of template decoration:
template<int LIMIT>
__global__ void
test_kernel( const float* B, const float* C, float* A_out, int array_size)
{
int j = threadIdx.x + blockIdx.x * blockDim.x;
if (j < array_size) {
#pragma unroll
for (int i = 0; i < LIMIT; i++) {
A_out[i] = B[i] + C[i];
}
}
}
template __global__ void test_kernel<4>(const float*, const float*, float*, int);
template __global__ void test_kernel<64>(const float*, const float*, float*, int);
template __global__ void test_kernel<256>(const float*, const float*, float*, int);
template __global__ void test_kernel<1024>(const float*, const float*, float*, int);
template __global__ void test_kernel<4096>(const float*, const float*, float*, int);
template __global__ void test_kernel<8192>(const float*, const float*, float*, int);
您可以将其编译为PTX并亲自查看(至少使用CUDA 7版本编译器和默认的计算能力2.0目标体系结构),完全展开了 LIMIT = 4096
的内核。 LIMIT = 8192
的情况未展开。如果您有更多的耐心,可以尝试使用模板为该代码找到确切的编译器限制,尽管我怀疑知道这样做特别有启发性。
You can compile this to PTX and see for yourself that (at least with the CUDA 7 release compiler and the default compute capability 2.0 target architecture), the kernels with up to LIMIT=4096
are fully unrolled. The LIMIT=8192
case is not unrolled. If you have more patience that I do, you can probably play around with the templating to find the exact compiler limit for this code, although I doubt that is particularly instructive to know.
您还可以通过编译器自己查看所有严重展开的版本都使用相同数量的寄存器(由于内核的琐碎性质)。
You can also see for yourself via the compiler that all of the heavily unrolled versions use the same number of registers (because of the trivial nature of your kernel).
这篇关于确定CUDA中#pragma展开N的最佳值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!