CUDA:Nsight VS2010配置文件__device__函数 [英] CUDA: Nsight VS2010 profile __device__ function

查看:366
本文介绍了CUDA:Nsight VS2010配置文件__device__函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何分析一个 __ device __ 函数,该函数位于Nsight 2.2的 __ global __ visual studio 2010.我需要知道哪个函数消耗了大量的资源和时间。我有CC 2.0上的CUDA 5.0。Nsight Visual Studio版本3.0 CUDA Profiler引入了源码相关实验。资料CUDA活动支持以下来源级实验:




  • 指示计数线程指令,活动线程直方图,内核中每个用户指令的预测线程直方图。


  • 分歧分支 - 收集分支,分支未执行和分歧计数流控制指令。


  • 内存事务 - 收集事务计数,理想事务计数器和全局,本地和共享




此信息是根据SASS指令收集的。如果内核使用-lineinfo(--generate-line-info)编译,则信息可以累积到PTX和高级源代码。由于这些数据是从SASS汇总的,一些统计数据可能不直观到高级别的来源。例如,当预期100%被采用时,分支统计量可能显示100%未采取。如果你看看SASS代码,你可能会看到编译器颠倒了条件。



也不要说在优化构建,编译器有时无法保持行表信息。



>



此时,硬件性能计数器和计时仅在内核级别可用。



设备代码计时可以使用clock )和clock64()在评论中提到。这是一种非常先进的技术,其需要能够理解SASS并解释关于SM经线调度器的结果。


I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.

解决方案

Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:

  • Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.

  • Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.

  • Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.

This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.

Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.

At this time hardware performance counters and timing is only available at the kernel level.

Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.

这篇关于CUDA:Nsight VS2010配置文件__device__函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆