为什么 NVRTC 不优化我的整数除法和模运算? [英] Why isn't NVRTC optimizing out my integer division and modulo operations?

查看：97 发布时间：2021/6/12 20:50:56 optimization cuda compiler-flags nvrtc

本文介绍了为什么 NVRTC 不优化我的整数除法和模运算?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 NVRTC 中编译了一个内核:

I compiled a kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

我知道整数除法和取模在 CUDA GPU 上非常昂贵.但是我认为这种除以 2 的幂应该优化为位运算，直到我发现它不是:

I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

看起来 kernel_B 只是运行得更快.当忽略内核中的所有其他代码时，以 1024 个大小为 1024 的块启动时，nvprof 显示 kernel_A 平均运行 15.2us，而 kernel_B 平均运行 7.4us.我推测 NVRTC 没有优化整数除法和取模.

it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.

结果是在 GeForce 750 Ti、CUDA 8.0 上获得的，平均 100 次调用.赋予 nvrtcCompileProgram() 的编译器选项是 -arch compute_50.

The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.

这是预期的吗?

为什么 NVRTC 不优化我的整数除法和模运算? [英] Why isn't NVRTC optimizing out my integer division and modulo operations?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么 NVRTC 不优化我的整数除法和模运算? [英] Why isn&#39;t NVRTC optimizing out my integer division and modulo operations?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

为什么 NVRTC 不优化我的整数除法和模运算? [英] Why isn't NVRTC optimizing out my integer division and modulo operations?

登录关闭