NSIGHT:在内核级实验中是什么红色和黑色? [英] NSIGHT: What are those Red and Black colour in kernel-level experiments?

查看:190
本文介绍了NSIGHT:在内核级实验中是什么红色和黑色?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图学习NSIGHT。



有人可以告诉我这些红色标记在下面的截图中是从维基百科条目是一个方便的网站,以了解每个ccc功能。



您可以使用CUDA工具包提供的deviceQuery实用程序示例代码查询此参数,或者在执行时使用CUDA API 此处





事情是,理想情况下,使用每个线程少于32个寄存器,可以执行16个128个线程块。这意味着高的入住率。在大多数情况下,您的内核每个块需要更多的32个寄存器,因此不再可能在SM上同时执行16个块,然后以块级粒度完成减少,即减少块数。这是酒吧捕捉。



您可以使用线程和块的数量,或者使用 _ _launch_bounds_ _ 指令来优化内核,或者您可以使用 - maxrregcount 设置可降低单个内核使用的寄存器数量,以查看其是否提高了总体执行速度。


I am trying to learn NSIGHT.

Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.

Similarly what are those black lines which are varying in length, indicating?

Another example from same page:

解决方案

Here is the basic explanation:

  • Grey bars represent the available amount of resources your particular device has (due to both its hardware and its compute capability).
  • Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
  • The red dots represent your the resources that you are using.

For instance, looking at "Active warps" on the first picture:

  • Grey: The device supports 64 active warps concurrently.
  • Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
  • Red: Your achieve 63.56 active warps.

In such case, the grey bar is under the black one, so you cant see the grey one.

In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.

About blank spaces The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy". You are really executing 1024 threads per block with 32 warps per block on the first picture. In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.

Extended information

As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:

A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16 blocks per SM and 64 warps per SM. You also have a maximum number of registers avaliable to use (65536 on that case).

This wikipedia entry is a handy site to be aware of each ccc features.

You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.

Performance considerations

The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.

You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.

这篇关于NSIGHT:在内核级实验中是什么红色和黑色?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆