nvprof选项用于带宽 [英] nvprof option for bandwidth

查看：1217 发布时间：2017/3/4 15:02:38 cuda profiling nvprof

本文介绍了nvprof选项用于带宽的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用nvprof --metrics从命令行测量带宽的正确选项是什么？我使用flop_dp_efficiency来获得峰值FLOPS的百分比，但在手册中似乎有很多选项用于带宽测量，我不知道我正在测量什么。例如dram_read，dram_write，gld_read，gld_write看起来都一样。此外，我应该报告bandwdith作为读+写吞吐量的总和，假设两者同时发生？

编辑：

根据图表的优秀答案，从设备内存到内核的带宽是多少？我想在从内核到设备内存的路径上占用最小的带宽（读+写），这可能是对L2缓存的打击。

I我试图通过测量FLOPS和带宽来确定内核是计算或内存绑定。

解决方案

为了理解在这一领域的分析器指标，有必要了解GPU中的内存模型。我找到

请参阅 CUDA profiler metrics reference for a description of each metric：

dram_read_throughput，dram_read_transactions

dram_write_throughput，dram_write_transactions

sysmem_read_throughput，sysmem_read_transactions

sysmem_write_throughput，sysmem_write_transactions

l2_l1_read_transactions，l2_l1_read_throughactions

l2_l1_write_transactions，l2_l1_write_throughput

l2_tex_read_transactions，l2_texture_read_throughput

shared_store_throughput，shared_store_transactions

l1_cache_local_hit_rate

l1是直写缓存，因此此路径没有（独立）指标 - 请参阅其他地方指标

l1_cache_global_hit_rate

查看12上的注释

gld_efficiency，gld_throughput，gld_transactions

gst_efficiency，gst_throughput，gst_transactions

注意：

从右到左的箭头表示阅读活动。从左到右的箭头表示写入活动。

全局是一个逻辑空间。从程序员的角度来看，它指的是逻辑地址空间。针对全局空间的事务可能会出现在其中一个缓存，sysmem或设备内存（dram）中。另一方面，dram是物理实体（例如，L1和L2高速缓存）。逻辑空间都在紧接内核列右侧的图的第一列中描绘。右侧剩余的列是物理实体或资源。

我没有尝试在图表上标记每个可能的内存指标。希望如果你需要找出其他人，这个图表将是有指导意义的。

根据上面的描述，这可能是你的问题回答。然后，你需要澄清你的要求 - 你想测量什么？但是，根据你写的问题，你可能想看看dram_xxx指标，如果你关心的是实际消耗的内存带宽。

此外，如果你只是尝试获得最大可用内存带宽的估计，使用CUDA样本代码 bandwidthTest 可能是最简单的方法来获取代理测量。只需使用报告的设备到设备带宽数，作为您的代码可用的最大内存带宽的估计。

通过上述想法，dram_utilization指标给出一个缩放结果它表示实际使用的总可用内存带宽的一部分（从0到10）。

What is the correct option for measuring bandwidth using nvprof --metrics from the command line? I am using flop_dp_efficiency to get the percentage of peak FLOPS, but there seems to be many options for bandwidth measurement in the manual that I don't really understand what I am measuring. e.g. dram_read, dram_write, gld_read, gld_write all look the same to me. Also, should I report bandwdith as a sum of read+write throughput by assuming both happen simultaneously ?

Edit:

Based on the excellent answer with the diagram, what would be the bandwidth going from the device memory to the kernel ? I am thinking to take the minimum of the bandwidth (read+write) on the path from the kernel to the device memory, which is probably dram to L2 cache.

I am trying to determine if a kernel is compute- or memory- bound by measuring FLOPS and bandwidth.

解决方案

In order to understand the profiler metrics in this area, it's necessary to have an understanding of the memory model in a GPU. I find the diagram published in the Nsight Visual Studio edition documentation to be useful. I have marked up the diagram with numbered arrows which refer to the numbered metrics (and direction of transfer) I have listed below:

Please refer to the CUDA profiler metrics reference for a description of each metric:

dram_read_throughput, dram_read_transactions
dram_write_throughput, dram_write_transactions
sysmem_read_throughput, sysmem_read_transactions
sysmem_write_throughput, sysmem_write_transactions
l2_l1_read_transactions, l2_l1_read_throughput
l2_l1_write_transactions, l2_l1_write_throughput
l2_tex_read_transactions, l2_texture_read_throughput
texture is read-only, there are no transactions possible on this path
shared_load_throughput, shared_load_transactions
shared_store_throughput, shared_store_transactions
l1_cache_local_hit_rate
l1 is write-through cache, so there are no (independent) metrics for this path - refer to other local metrics
l1_cache_global_hit_rate
see note on 12
gld_efficiency, gld_throughput, gld_transactions
gst_efficiency, gst_throughput, gst_transactions

Notes:

An arrow from right to left indicates read activity. An arrow from left to right indicates write activity.
"global" is a logical space. It refers to a logical address space from the programmers point of view. Transactions directed to the "global" space could end up in one of the caches, in sysmem, or in device memory (dram). "dram", on the other hand, is a physical entity (as is the L1 and L2 caches, for example). The "logical spaces" are all depicted in the first column of the diagram immediately to the right of the "kernel" column. The remaining columns to the right are physical entities or resources.
I have not tried to mark every possible memory metric with a location on the chart. Hopefully this chart will be instructive if you need to figure out the others.

With the above description, it's possible your question still may not be answered. It would then be necessary for you to clarify your request -- "what do you want to measure exactly?" However based on your question as written, you probably want to look at the dram_xxx metrics, if what you care about is actual consumed memory bandwidth.

Also, if you are simply trying to get an estimate of the maximum available memory bandwidth, using the CUDA sample code bandwidthTest is probably the easiest way to get a proxy measurement for that. Just use the reported device to device bandwidth number, as an estimate of the maximum memory bandwidth available to your code.

Combining the above ideas, the dram_utilization metric gives a scaled result that represents the portion (from 0 to 10) of the total available memory bandwidth that was actually used.

这篇关于nvprof选项用于带宽的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

nvprof选项用于带宽 [英] nvprof option for bandwidth

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

nvprof选项用于带宽 [英] nvprof option for bandwidth

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭