解释 ptxas 的详细输出,第一部分 [英] Interpreting the verbose output of ptxas, part I
问题描述
我正在尝试了解我的每个 CUDA 线程的资源使用情况,以用于手写内核.
I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.
我使用 nvcc -arch=sm_20 -ptxas-options=-v
将我的 kernel.cu
文件编译为 kernel.o
文件
I compiled my kernel.cu
file to a kernel.o
file with nvcc -arch=sm_20 -ptxas-options=-v
我得到了以下输出(通过c++filt
):
and I got the following output (passed through c++filt
):
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
看上面的输出,这样说对吗
Looking at the output above, is it correct to say that
- 每个 CUDA 线程使用 46 个寄存器?
- 没有寄存器溢出到本地内存?
我在理解输出方面也有一些问题.
I am also having some issues with understanding the output.
我的内核正在调用很多
__device__
函数.IS 72 字节总和__global__
和__device__
函数的堆栈帧的内存是多少?
My kernel is calling a whole lot of
__device__
functions. IS 72 bytes the sum-total of the memory for the stack frames of the__global__
and__device__
functions?
0 字节溢出存储
和 0 字节溢出加载有什么区别
为什么 cmem
的信息(我假设是常量内存)会以不同的数字重复两次?在内核中我没有使用任何常量记忆.这是否意味着编译器在后台会告诉 GPU 使用一些常量内存?
Why is the information for cmem
(which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant
memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
此问题在以下内容中继续":解释 ptxas 的详细输出,第二部分
推荐答案
- 每个 CUDA 线程使用 46 个寄存器?是的,正确
- 没有寄存器溢出到本地内存?是的,正确
__global__
和__device__
函数的堆栈帧的内存总和是 72 字节吗?是的,正确- 0 字节溢出存储和 0 字节溢出加载有什么区别?
- 公平的问题,负载可能大于存储,因为您可能会溢出计算值,加载一次,丢弃它(即将其他内容存储到该寄存器中)然后再次加载(即重用它).更新: 另请注意,溢出加载/存储计数基于@njuffa 在下面的评论中描述的静态分析
- Each CUDA thread is using 46 registers? Yes, correct
- There is no register spilling to local memory? Yes, correct
- Is 72 bytes the sum-total of the memory for the stack frames of the
__global__
and__device__
functions? Yes, correct - What is the difference between 0 byte spill stores and 0 bytes spill loads?
- Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
- 常量内存用于几个目的,包括
__constant__
变量和内核参数,使用不同的银行",开始有点详细,但只要您使用小于 64KB 的内存__constant__
变量和小于 4KB 的内核参数就可以了. - Constant memory is used for a few purposes including
__constant__
variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your__constant__
variables and less than 4KB for kernel arguments you will be ok.
这篇关于解释 ptxas 的详细输出,第一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!