使用nvprof计算gst_throughput和gld_throughput [英] calculating gst_throughput and gld_throughput with nvprof

查看:1504
本文介绍了使用nvprof计算gst_throughput和gld_throughput的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了以下问题。我想使用 nvprof gst_efficieny gld_efficiency c $ c>。与cuda 5.0一起发布的文档告诉我,对于具有计算能力2.0-3.0的设备,使用以下公式生成这些文件:

  gld_efficiency = 100 * gld_requested_throughput / gld_throughput 

gst_efficiency 100 * gst_requested_throughput / gst_throughput

必需度量标准下列公式:

  gld_throughput =((128 * global_load_hit)+(l2_subp0_read_requests + l2_subp1_read_requests)* 32  - l1_local_ld_miss * 128))/ gputime 

gst_throughput =(l2_subp0_write_requests + l2_subp1_write_requests)* 32 - (l1_local_ld_miss * 128))/ gputime

gld_requested_throughput =(gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8
* gld_inst_64bit + 16 * gld_inst_128bit)/ gputime

gst_requested_throughput =(gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8
* gst_inst_64bit + 16 * gst_inst_128bit )/ gputime

由于对于使用的度量没有给出公式,我假设这些是事件由nvprof计数。但有些事件似乎没有在我的gtx 460(也试过gtx 560 Ti)。我粘贴了 nvprof --query-events 输出 / p>

任何想法出现了什么问题或我误解了什么?




我不想使用CUDA Visual Profiler,因为我试图分析我的应用程序的不同参数。因此,我想使用多个参数配置运行 nvprof ,记录多个事件(每次运行一个),然后在表中输出数据。我已经获得这个自动化,并为其他指标(即发出的指示)工作,并希望这样做的负载和存储效率。这就是为什么我对涉及 nvvp 的解决方案不感兴趣的原因。顺便说一句,对于我的应用程序 nvvp 无法计算存储效率所需的指标,因此它在这种情况下根本不帮助。

解决方案

我很高兴有人有同样的问题:)我试图做同样的事情,不能使用Visual Profiler,因为我



NVidia网站上的公式记录不清 - 实际上变量可以是:



a)事件



b)其他指标



c)



但是,很多的指标有打字错误或在nvprof有点不同在网站上。此外,变量没有标记,所以你不能通过看是否是a),b)或c)告诉。我用一个脚本grep他们,然后不得不手工修复。这是我发现的:



1)l1_local / global_ld / st_hit / miss
这些在nvprof中有load ld/st。



2)l2_ ... whatever ... _requests
这些在nvprof中有sector_queries,而不是请求。



3)local_load / store_hit / miss
这些在分析器中另外有l1_ - l1_local / global_load / store_hit / miss



4)tex0_cache_misses
这在分析器中有sector - tex0_cache_sector_misses



5)tex_cache_sector_queries
缺少0 - 因此nvprof中的tex0_cache_sector_queries。



最后,变量: p>

1)#SM
流多处理器的数量。通过cudaDeviceProp获取。



2)gputime
显然,GPU上的执行时间。



3)warp_size
GPU上的翘曲大小,再次通过cudaDeviceProp获得。



4)max_warps_per_sm
可执行每个块的sm * #SM * warp。我猜。



5)elapsed_cycles
找到这个:
https: /devtalk.nvidia.com/default/topic/518827/computeprof-34-active-cycles-34-counter-34-active-cycles-34-value-doesn-39-t-make-sense-to-/

希望这可以帮助你和其他一些遇到同样问题的人:)


I got the following problem. I want to measure the gst_efficieny and the gld_efficiency for my cuda application using nvprof. The documentation distributed with cuda 5.0 tells me to generate these using the following formulas for devices with compute capability 2.0-3.0:

gld_efficiency = 100 * gld_requested_throughput/ gld_throughput

gst_efficiency 100 * gst_requested_throughput / gst_throughput

For the required metrics the following formulas are given:

gld_throughput = ((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime

gst_throughput = (l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime

gld_requested_throughput = (gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8
* gld_inst_64bit + 16 * gld_inst_128bit) / gputime

gst_requested_throughput = (gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8
* gst_inst_64bit + 16 * gst_inst_128bit) / gputime

Since for the metrics used no formula is given I assume that these are events which can be counted by nvprof. But some of the events seem not to be available on my gtx 460 (also tried gtx 560 Ti). I pasted the output of nvprof --query-events

Any ideas what's going wrong or what I'm misinterpreting?

EDIT: I don't want to use CUDA Visual Profiler, since I'm trying to analyse my application for different parameters. I therefore want to run nvprof using multiple parameter configurations, recording multiple events (each one in its one run) and then output the data in tables. I got this automated already and working for other metrics (i.e. instructions issued) and want to do this for load and store efficiency. This is why I'm not interested in solution involving nvvp. By the way, for my application nvvp fails to calculate the metrics required for store-efficiency so it doesn't help my at all in this case.

解决方案

I'm glad somebody had the same issue :) I was trying to do the very same thing and couldn't use the Visual Profiler, because I wanted to profile like 6000 different kernels.

The formulas on NVidia site are poorly documented - actually the variables can be:

a) events

b) other Metrics

c) different variables dependent on the GPU you have

However, a LOT of the metrics there have either typos in it or are versed a bit differently in nvprof than they are on the site. Also, there the variables are not tagged, so you can't tell just by looking whether they are a),b) or c). I used a script to grep them and then had to fix it by hand. Here is what I found:

1) "l1_local/global_ld/st_hit/miss" These have "load"/"store" in nvprof instead of "ld"/"st" on site.

2) "l2_ ...whatever... _requests" These have "sector_queries" in nvprof instead of "requests".

3) "local_load/store_hit/miss" These have "l1_" in additionally in the profiler - "l1_local/global_load/store_hit/miss"

4) "tex0_cache_misses" This one has "sector" in it in the profiler - "tex0_cache_sector_misses"

5) "tex_cache_sector_queries" Missing "0" - so "tex0_cache_sector_queries" in the nvprof.

Finally, the variables:

1) "#SM" The number of streaming multiprocessors. Get via cudaDeviceProp.

2) "gputime" Obviously, the execution time on GPU.

3) "warp_size" The size of warp on your GPU, again get via cudaDeviceProp.

4) "max_warps_per_sm" Number of blocks executable on an sm * #SM * warps per block. I guess.

5) "elapsed_cycles" Found this: https://devtalk.nvidia.com/default/topic/518827/computeprof-34-active-cycles-34-counter-34-active-cycles-34-value-doesn-39-t-make-sense-to-/ But still not entirely sure, if I get it.

Hopefully this helps you and some other people who encounter the same problem :)

这篇关于使用nvprof计算gst_throughput和gld_throughput的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆