解释ptxas的详细输出,第一部分 [英] Interpreting the verbose output of ptxas, part I

查看:855
本文介绍了解释ptxas的详细输出,第一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解手写内核的每个CUDA线程的资源使用情况.

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

我用nvcc -arch=sm_20 -ptxas-options=-vkernel.cu文件编译为kernel.o文件

,我得到以下输出(通过c++filt传递):

and I got the following output (passed through c++filt):

ptxas info    : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info    : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]

看看上面的输出,

  • 每个CUDA线程正在使用46个寄存器吗?
  • 没有寄存器溢出到本地存储器吗?

我在理解输出方面也遇到了一些问题.

I am also having some issues with understanding the output.

  • 我的内核正在调用很多__device__函数. IS总和为72个字节 __global____device__函数的堆栈帧的内存量是多少?

  • My kernel is calling a whole lot of __device__ functions. IS 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions?

0 byte spill stores0 bytes spill loads

为什么cmem的信息(我假设是恒定内存)以不同的数字重复了两次?在内核中,我没有使用任何常量 记忆.这是否意味着编译器要在内幕下告诉GPU使用一些恒定内存?

Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?

以下问题中继续":解释ptxas的详细输出,第二部分

推荐答案

  • 每个CUDA线程都使用46个寄存器吗? 是的,正确
  • 没有寄存器溢出到本地存储器吗? 是的,正确的
  • __global____device__函数的堆栈帧的内存总和是72个字节吗? 是的,正确
  • 0字节溢出存储区和0字节溢出存储区之间有什么区别?
    • 公平的问题,由于您可能会溢出计算值,将其加载一次,将其丢弃(即将其他内容存储到该寄存器中),然后再次加载(即重用它),因此加载可能会大于存储. 更新:还请注意,溢出负荷/存储量基于静态分析,如@njuffa在以下评论中所述
      • Each CUDA thread is using 46 registers? Yes, correct
      • There is no register spilling to local memory? Yes, correct
      • Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
      • What is the difference between 0 byte spill stores and 0 bytes spill loads?
        • Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
          • 恒定内存用于多种用途,包括__constant__变量和内核参数,使用了不同的存储体",它开始变得有些详细,但是只要您对__constant__变量使用的内存少于64KB,并且内核参数少于4KB,就可以了.
          • Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.

          这篇关于解释ptxas的详细输出,第一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆