CUDA:Nsight VS2010 profile __device__ 函数 [英] CUDA: Nsight VS2010 profile __device__ function

查看:19
本文介绍了CUDA:Nsight VS2010 profile __device__ 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何使用 Visual Studio 2010 上的 Nsight 2.2 分析 __global__ 函数内部的 __device__ 函数.我需要知道哪个函数正在消耗大量的资源和时间.我在 CC 2.0 上有 CUDA 5.0.

I would like to know how to profile a __device__ function which is inside a __global__ function with Nsight 2.2 on visual studio 2010. I need to know which function is consuming a lot of resources and time. I have CUDA 5.0 on CC 2.0.

推荐答案

Nsight Visual Studio Edition 3.0 CUDA Profiler 引入了源相关实验.Profile CUDA Activity 支持以下源级实验:

Nsight Visual Studio Edition 3.0 CUDA Profiler introduces source correlated experiments. The Profile CUDA Activity supports the following source level experiments:

  • 指令计数 - 收集内核中每个用户指令的已执行指令、已执行线程指令、活动线程直方图、预测线程直方图.不收集有关系统调用 (printf) 的信息.

  • Instruction Count - Collects instructions executed, thread instructions executed, active thread histogram, predicated thread histogram for every user instruction in the kernel. Information on syscalls (printf) is not collected.

发散分支 - 收集已采用的分支、未采用的分支以及流控制指令的发散计数.

Divergent Branch - Collects branch taken, branch not taken, and divergence count for flow control instructions.

内存事务 - 为全局、本地和共享内存指令收集事务计数、理想事务计数器和请求字节.

Memory Transactions - Collects transaction counts, ideal transaction counter, and requested bytes for global, local, and shared memory instructions.

此信息是根据 SASS 指令收集的.如果内核使用 -lineinfo (--generate-line-info) 编译,则信息可以汇总到 PTX 和高级源代码.由于此数据是从 SASS 汇总的,因此某些统计数据对于高级来源可能并不直观.例如,一个分支统计数据可能显示 100% 未采用,而您预计 100% 已采用.如果您查看 SASS 代码,您可能会看到编译器反转了条件.

This information is collected per SASS instruction. If the kernel is compiled with -lineinfo (--generate-line-info) the information can be rolled up to PTX and high level source code. Since this data is rolled up from SASS some statistics may not be intuitive to the high level source. For example a branch statistic may show 100% not taken when you expected 100% taken. If you look at the SASS code you may see that the compiler reversed the conditional.

还请注意,在优化构建时,编译器有时无法维护行表信息.

Please also not that on optimized builds the compiler is sometimes unable to maintain line table information.

此时硬件性能计数器和计时仅在内核级别可用.

At this time hardware performance counters and timing is only available at the kernel level.

可以使用注释中提到的clock() 和clock64() 来完成设备代码时序.这是一项非常先进的技术,需要理解 SASS 和解释与 SM warp 调度器相关的结果的能力.

Device code timing can be done using clock() and clock64() as mentioned in comments. This is a very advanced technique which requires both ability to understand SASS and interpret results with respect to the SM warp schedulers.

这篇关于CUDA:Nsight VS2010 profile __device__ 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆