Cuda零拷贝性能 [英] Cuda zero-copy performance

查看:575
本文介绍了Cuda零拷贝性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都有使用零拷贝分析CUDA应用程序的性能的经验(参考这里:默认固定内存VS零复制内存)内存模型?

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?

我有一个内核使用零复制功能和NVVP我看到以下内容:

I have a kernel that uses the zero-copy feature and with NVVP I see the following:

在平均问题大小上运行内核我获得0.7%的指令重放开销,所以没有什么主要。所有这0.7%是全局内存重放开销。

Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.

当我真正解决问题大小时,我得到的指令重放开销为95.7%,所有这些都是由于全局内存重放开销。

When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.

然而,正常问题大小的内核运行和非常大的问题大小的内核运行的全局负载效率和全局存储效率是相同的。我不太确定这个指标的组合是什么。

However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.

我不确定的主要事情是NVVP中的哪些统计信息将帮助我看到零复制功能正在发生什么。

The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?

推荐答案

Fermi和Kepler GPU需要重放内存指令有多种原因:

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:


  1. 内存操作是针对需要多个事务的大小说明符(向量类型),以便执行地址散度计算和传送数据

  2. 内存操作的线程地址发散需要访问多个缓存行。

  3. 内存事务错过了L1缓存。当未命中值返回到L1时,L1通知warp调度器重放指令。

  4. LSU单元资源已满,并且当资源可用时,需要重播指令。 / li>
  1. The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
  2. The memory operation had thread address divergence requiring access to multiple cache lines.
  3. The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
  4. The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

延迟到


  • L2为200 -400个周期

  • 设备内存(dram)为400-800个周期

  • PCIe上的零复制内存为1000秒周期

由于延迟增加,LSU资源的未命中和争用增加,重放开销增加。

The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

全局负载效率不增加,因为它是将被执行的存储器指令需要传送的数据的理想量与实际传送的数据量的比率。理想意味着所执行的线程从高速缓存行边界开始访问存储器中的顺序元素(32位操作是1个高速缓存行,64位操作是2个高速缓存行,128位操作是4个高速缓存行)。访问零复制速度较慢,效率较低,但不会增加或更改传输的数据量。

The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

分析器显示以下计数器:

The profiler's exposes the following counters:


  • gld_throughput

  • l1_cache_global_hit_rate

  • dram_ {read,write} _throughput

  • l2_l1_read_hit_rate

  • gld_throughput
  • l1_cache_global_hit_rate
  • dram_{read, write}_throughput
  • l2_l1_read_hit_rate

在零拷贝的情况下,所有这些指标应低得多。

In the zero copy case all of these metrics should be much lower.

Nsight VSE CUDA Profiler内存实验将显示通过PCIe(零复制内存)访问的数据量。

The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

这篇关于Cuda零拷贝性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆