nvprof选项用于带宽 [英] nvprof option for bandwidth

查看:1217
本文介绍了nvprof选项用于带宽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用nvprof --metrics从命令行测量带宽的正确选项是什么?我使用flop_dp_efficiency来获得峰值FLOPS的百分比,但在手册中似乎有很多选项用于带宽测量,我不知道我正在测量什么。例如dram_read,dram_write,gld_read,gld_write看起来都一样。此外,我应该报告bandwdith作为读+写吞吐量的总和,假设两者同时发生?



编辑



根据图表的优秀答案,从设备内存到内核的带宽是多少?我想在从内核到设备内存的路径上占用最小的带宽(读+写),这可能是对L2缓存的打击。



I我试图通过测量FLOPS和带宽来确定内核是计算或内存绑定。

解决方案

为了理解在这一领域的分析器指标,有必要了解GPU中的内存模型。我找到



请参阅 CUDA profiler metrics reference for a description of each metric:


  1. dram_read_throughput,dram_read_transactions

  2. dram_write_throughput,dram_write_transactions

  3. sysmem_read_throughput,sysmem_read_transactions

  4. sysmem_write_throughput,sysmem_write_transactions

  5. l2_l1_read_transactions,l2_l1_read_throughactions

  6. l2_l1_write_transactions,l2_l1_write_throughput

  7. l2_tex_read_transactions,l2_texture_read_throughput


  8. shared_store_throughput,shared_store_transactions

  9. l1_cache_local_hit_rate
  10. / li>
  11. l1是直写缓存,因此此路径没有(独立)指标 - 请参阅其他地方指标

  12. l1_cache_global_hit_rate

  13. 查看12上的注释

  14. gld_efficiency,gld_throughput,gld_transactions

  15. gst_efficiency,gst_throughput,gst_transactions

注意:


  1. 从右到左的箭头表示阅读活动。从左到右的箭头表示写入活动。

  2. 全局是一个逻辑空间。从程序员的角度来看,它指的是逻辑地址空间。针对全局空间的事务可能会出现在其中一个缓存,sysmem或设备内存(dram)中。另一方面,dram是物理实体(例如,L1和L2高速缓存)。 逻辑空间都在紧接内核列右侧的图的第一列中描绘。右侧剩余的列是物理实体或资源。

  3. 我没有尝试在图表上标记每个可能的内存指标。希望如果你需要找出其他人,这个图表将是有指导意义的。

根据上面的描述,这可能是你的问题回答。然后,你需要澄清你的要求 - 你想测量什么?但是,根据你写的问题,你可能想看看dram_xxx指标,如果你关心的是实际消耗的内存带宽。



此外,如果你只是尝试获得最大可用内存带宽的估计,使用CUDA样本代码 bandwidthTest 可能是最简单的方法来获取代理测量。只需使用报告的设备到设备带宽数,作为您的代码可用的最大内存带宽的估计。



通过上述想法,dram_utilization指标给出一个缩放结果它表示实际使用的总可用内存带宽的一部分(从0到10)。


What is the correct option for measuring bandwidth using nvprof --metrics from the command line? I am using flop_dp_efficiency to get the percentage of peak FLOPS, but there seems to be many options for bandwidth measurement in the manual that I don't really understand what I am measuring. e.g. dram_read, dram_write, gld_read, gld_write all look the same to me. Also, should I report bandwdith as a sum of read+write throughput by assuming both happen simultaneously ?

Edit:

Based on the excellent answer with the diagram, what would be the bandwidth going from the device memory to the kernel ? I am thinking to take the minimum of the bandwidth (read+write) on the path from the kernel to the device memory, which is probably dram to L2 cache.

I am trying to determine if a kernel is compute- or memory- bound by measuring FLOPS and bandwidth.

解决方案

In order to understand the profiler metrics in this area, it's necessary to have an understanding of the memory model in a GPU. I find the diagram published in the Nsight Visual Studio edition documentation to be useful. I have marked up the diagram with numbered arrows which refer to the numbered metrics (and direction of transfer) I have listed below:

Please refer to the CUDA profiler metrics reference for a description of each metric:

  1. dram_read_throughput, dram_read_transactions
  2. dram_write_throughput, dram_write_transactions
  3. sysmem_read_throughput, sysmem_read_transactions
  4. sysmem_write_throughput, sysmem_write_transactions
  5. l2_l1_read_transactions, l2_l1_read_throughput
  6. l2_l1_write_transactions, l2_l1_write_throughput
  7. l2_tex_read_transactions, l2_texture_read_throughput
  8. texture is read-only, there are no transactions possible on this path
  9. shared_load_throughput, shared_load_transactions
  10. shared_store_throughput, shared_store_transactions
  11. l1_cache_local_hit_rate
  12. l1 is write-through cache, so there are no (independent) metrics for this path - refer to other local metrics
  13. l1_cache_global_hit_rate
  14. see note on 12
  15. gld_efficiency, gld_throughput, gld_transactions
  16. gst_efficiency, gst_throughput, gst_transactions

Notes:

  1. An arrow from right to left indicates read activity. An arrow from left to right indicates write activity.
  2. "global" is a logical space. It refers to a logical address space from the programmers point of view. Transactions directed to the "global" space could end up in one of the caches, in sysmem, or in device memory (dram). "dram", on the other hand, is a physical entity (as is the L1 and L2 caches, for example). The "logical spaces" are all depicted in the first column of the diagram immediately to the right of the "kernel" column. The remaining columns to the right are physical entities or resources.
  3. I have not tried to mark every possible memory metric with a location on the chart. Hopefully this chart will be instructive if you need to figure out the others.

With the above description, it's possible your question still may not be answered. It would then be necessary for you to clarify your request -- "what do you want to measure exactly?" However based on your question as written, you probably want to look at the dram_xxx metrics, if what you care about is actual consumed memory bandwidth.

Also, if you are simply trying to get an estimate of the maximum available memory bandwidth, using the CUDA sample code bandwidthTest is probably the easiest way to get a proxy measurement for that. Just use the reported device to device bandwidth number, as an estimate of the maximum memory bandwidth available to your code.

Combining the above ideas, the dram_utilization metric gives a scaled result that represents the portion (from 0 to 10) of the total available memory bandwidth that was actually used.

这篇关于nvprof选项用于带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆