为什么 NVRTC 不优化我的整数除法和模运算? [英] Why isn't NVRTC optimizing out my integer division and modulo operations?
问题描述
我在 NVRTC 中编译了一个内核:
I compiled a kernel in NVRTC:
__global__ void kernel_A(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx / 32;
unsigned char lane_id = idx % 32;
/* ... */
}
我知道整数除法和取模在 CUDA GPU 上非常昂贵.但是我认为这种除以 2 的幂应该优化为位运算,直到我发现它不是:
I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:
__global__ void kernel_B(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx >> 5;
unsigned char lane_id = idx & 31;
/* ... */
}
看起来 kernel_B
只是运行得更快.当忽略内核中的所有其他代码时,以 1024 个大小为 1024 的块启动时,nvprof
显示 kernel_A
平均运行 15.2us,而 kernel_B
平均运行 7.4us.我推测 NVRTC 没有优化整数除法和取模.
it seems kernel_B
just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof
shows kernel_A
runs for 15.2us in average, while kernel_B
runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.
结果是在 GeForce 750 Ti、CUDA 8.0 上获得的,平均 100 次调用.赋予 nvrtcCompileProgram()
的编译器选项是 -arch compute_50
.
The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram()
is -arch compute_50
.
这是预期的吗?
推荐答案
彻底清除了代码库中的错误.原来我的应用程序是在 DEBUG
模式下构建的.这会导致额外的标志 -G
和 -lineinfo
传递给 nvrtcCompileProgram()
Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG
mode. This causes additional flags -G
and -lineinfo
passed to nvrtcCompileProgram()
来自 nvcc
手册页:
--device-debug
(-G)
为设备代码生成调试信息.关闭所有优化.不要用于分析;改用 -lineinfo.
Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.
这篇关于为什么 NVRTC 不优化我的整数除法和模运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!