带宽的 nvprof 选项 [英] nvprof option for bandwidth

查看：43 发布时间：2022/1/10 15:25:36 cuda profiling nvprof

本文介绍了带宽的 nvprof 选项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

从命令行使用 nvprof --metrics 测量带宽的正确选项是什么?我正在使用 flop_dp_efficiency 来获取峰值 FLOPS 的百分比，但是手册中似乎有很多带宽测量选项，我并不真正了解我在测量什么.例如dram_read、dram_write、gld_read、gld_write 在我看来都一样.另外，我是否应该通过假设两者同时发生来将带宽报告为读+写吞吐量的总和?

根据图表的出色答案，从设备内存到内核的带宽是多少?我正在考虑在从内核到设备内存的路径上采用最小带宽(读+写)，这可能是 DRAM 到 L2 缓存.

我正在尝试通过测量 FLOPS 和带宽来确定内核是否受计算或内存限制.

解决方案

为了了解这方面的分析器指标，有必要了解 GPU 中的内存模型.我发现

请参阅 CUDA 分析器指标参考每个指标的描述:

dram_read_throughput、dram_read_transactions
dram_write_throughput、dram_write_transactions
sysmem_read_throughput、sysmem_read_transactions
sysmem_write_throughput、sysmem_write_transactions
l2_l1_read_transactions，l2_l1_read_throughput
l2_l1_write_transactions，l2_l1_write_throughput
l2_tex_read_transactions、l2_texture_read_throughput
纹理是只读的，此路径上不可能有事务
shared_load_throughput、shared_load_transactions
shared_store_throughput、shared_store_transactions
l1_cache_local_hit_rate
l1 是直写缓存，因此该路径没有(独立)指标 - 请参阅其他本地指标
l1_cache_global_hit_rate
见第 12 条注释
gld_efficiency、gld_throughput、gld_transactions
gst_efficiency、gst_throughput、gst_transactions

注意事项:

从右到左的箭头表示读取活动.从左到右的箭头表示写活动.
global"是一个逻辑空间.从程序员的角度来看，它指的是逻辑地址空间.指向全局"空间的事务最终可能会出现在缓存之一、系统内存或设备内存(DRAM)中.另一方面，DRAM"是一个物理实体(例如，L1 和 L2 缓存也是如此).逻辑空间"都显示在图表的第一列中，紧邻内核"列的右侧.右侧的其余列是物理实体或资源.
我没有尝试用图表上的位置标记每个可能的内存指标.如果您需要弄清楚其他图表，希望这张图表对您有所帮助.

通过以上描述，您的问题可能仍然没有得到回答.然后，您有必要澄清您的要求——您想准确测量什么?"但是，根据您所写的问题，如果您关心的是实际消耗的内存带宽，您可能想查看 dram_xxx 指标.

此外，如果您只是想估计最大可用内存带宽，则使用 CUDA 示例代码 bandwidthTest 可能是获得代理测量的最简单方法.只需使用报告的设备到设备带宽数，作为代码可用的最大内存带宽的估计值.

结合上述想法，dram_utilization 指标给出了一个缩放结果，表示实际使用的总可用内存带宽的一部分(从 0 到 10).

What is the correct option for measuring bandwidth using nvprof --metrics from the command line? I am using flop_dp_efficiency to get the percentage of peak FLOPS, but there seems to be many options for bandwidth measurement in the manual that I don't really understand what I am measuring. e.g. dram_read, dram_write, gld_read, gld_write all look the same to me. Also, should I report bandwdith as a sum of read+write throughput by assuming both happen simultaneously ?

Edit:

Based on the excellent answer with the diagram, what would be the bandwidth going from the device memory to the kernel ? I am thinking to take the minimum of the bandwidth (read+write) on the path from the kernel to the device memory, which is probably dram to L2 cache.

I am trying to determine if a kernel is compute- or memory- bound by measuring FLOPS and bandwidth.

解决方案

In order to understand the profiler metrics in this area, it's necessary to have an understanding of the memory model in a GPU. I find the diagram published in the Nsight Visual Studio edition documentation to be useful. I have marked up the diagram with numbered arrows which refer to the numbered metrics (and direction of transfer) I have listed below:

Please refer to the CUDA profiler metrics reference for a description of each metric:

dram_read_throughput, dram_read_transactions
dram_write_throughput, dram_write_transactions
sysmem_read_throughput, sysmem_read_transactions
sysmem_write_throughput, sysmem_write_transactions
l2_l1_read_transactions, l2_l1_read_throughput
l2_l1_write_transactions, l2_l1_write_throughput
l2_tex_read_transactions, l2_texture_read_throughput
texture is read-only, there are no transactions possible on this path
shared_load_throughput, shared_load_transactions
shared_store_throughput, shared_store_transactions
l1_cache_local_hit_rate
l1 is write-through cache, so there are no (independent) metrics for this path - refer to other local metrics
l1_cache_global_hit_rate
see note on 12
gld_efficiency, gld_throughput, gld_transactions
gst_efficiency, gst_throughput, gst_transactions

Notes:

An arrow from right to left indicates read activity. An arrow from left to right indicates write activity.
"global" is a logical space. It refers to a logical address space from the programmers point of view. Transactions directed to the "global" space could end up in one of the caches, in sysmem, or in device memory (dram). "dram", on the other hand, is a physical entity (as is the L1 and L2 caches, for example). The "logical spaces" are all depicted in the first column of the diagram immediately to the right of the "kernel" column. The remaining columns to the right are physical entities or resources.
I have not tried to mark every possible memory metric with a location on the chart. Hopefully this chart will be instructive if you need to figure out the others.

With the above description, it's possible your question still may not be answered. It would then be necessary for you to clarify your request -- "what do you want to measure exactly?" However based on your question as written, you probably want to look at the dram_xxx metrics, if what you care about is actual consumed memory bandwidth.

Also, if you are simply trying to get an estimate of the maximum available memory bandwidth, using the CUDA sample code bandwidthTest is probably the easiest way to get a proxy measurement for that. Just use the reported device to device bandwidth number, as an estimate of the maximum memory bandwidth available to your code.

Combining the above ideas, the dram_utilization metric gives a scaled result that represents the portion (from 0 to 10) of the total available memory bandwidth that was actually used.

这篇关于带宽的 nvprof 选项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带宽的 nvprof 选项 [英] nvprof option for bandwidth

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

带宽的 nvprof 选项 [英] nvprof option for bandwidth

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭