如何解释CUDA中的指令重放 [英] How to explain Instruction replay in CUDA

查看:778
本文介绍了如何解释CUDA中的指令重放的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以总结CUDA中不同种类的指令重放的定义和原因。



它们是:


  1. / li>



  2. atomic_replay_overhead

  3. shared_load_replay
  4. >
  5. global_ld_mem_divergence_replays

  6. global_st_mem_divergence_replays


解决方案>

这个答案适用于Compute Capability 2.0 - 3.7(Fermi - Kepler)设备。



每个周期每个SM warp调度器挑选一个warp,



事件 inst_executed 是完成的扭曲指令计数。 thread_inst_executed 是完成指令的线程计数。



如果SM因为


  1. 常数缓存未命中立即常数(指令中引用的常量),


  2. 在全局/本地内存加载或存储中存在地址分歧,

  3. 共享内存加载或存储中的数据库冲突,

  4. 在原子或缩减操作中的地址冲突,

  5. 加载或存储操作需要将数据写入加载存储单元或从超出读/写的单元读取总线宽度(例如128位加载或存储)或

  6. 加载高速缓存未命中(当高速缓存中的数据准备就绪时会重新进行读取)

,则SM调度程序必须多次发出指令。这被称为指令重放。值inst_issued == inst_issued2 * 2 + inst_issued1 是指令完成的指令数量+指令重放次数。



重放使用指令发放槽降低SM的计算吞吐量。



下面列出的 _replay_overhead 指标可以帮助您确定哪些类型的操作正在导致重放。 _replay 活动可提供大小。



NVPROF / CUPTI活动和指标 >

EVENT GROUP 1 - 通用指令发放和停用计数




  • inst_executed:

  • inst_issued2:每个周期发出的双重指令数

  • inst_issued0:没有发出任何指令的周期数,每次弯曲增加。



EVENT GROUP 2 -




  • shared_load_replay:由于共享负载银行冲突导致的重播当两个或更多个共享存储器加载请求的地址落在同一存储器组中时)或当没有冲突但是在执行该指令的warp中由所有th
    广告访问的字的总数超过字的数目

  • shared_store_replay:由于共享存储库冲突而导致的重播(当两个或多个共享内存存储请求的地址位于同一内存中时) bank),或者当没有冲突但是由所有th
    访问的单词的总数读取在执行该指令超过可以在一个周期中存储的单词的数量时。

  • global_ld_mem_divergence_replays:全局内存加载的指令重放数。如果指令正在访问128字节以上的高速缓存行,则重放指令。对于每个额外的高速缓存行访问,计数器加1。

  • global_st_mem_divergence_replays:全局内存存储的指令重放数。如果指令正在访问128字节以上的高速缓存行,则重放指令。对于每个额外的缓存行访问,计数器加1。



METRIC GROUP - 计算效率。




  • inst_replay_overhead:每条指令的平均重播次数
    执行

  • local_replay_overhead:Average

  • atomic_replay_overhead:由于执行每条指令的原子和
    减少组冲突的平均重放次数
  • li>
  • global_replay_overhead:由于全局的平均重放次数
    执行每条指令的内存缓存未命中

  • shared_replay_overhead:由于共享$

  • global_cache_replay_overhead:由于
    而导致的平均重放次数。执行每条指令的全局内存缓存未命中



    • Compute Capability 5.x 设备(Maxwell)设备将重放从warp调度程序推送到各个单元。这减少了重放延迟,并释放调度程序以发出数学运算。这些设备上inst_issued / inst_executed = inst_replay_overhead的比率通常接近于0.


      Could anyone summarize the definition and reasons for different kinds of instruction replays in CUDA?

      They are:

      1. inst_replay_overhead:
      2. shared_replay_overhead:
      3. global_replay_overhead:
      4. global_cache_replay_overhead
      5. local_replay_overhead
      6. atomic_replay_overhead
      7. shared_load_replay
      8. shared_store_replay
      9. global_ld_mem_divergence_replays
      10. global_st_mem_divergence_replays

      解决方案

      This answer applies to Compute Capability 2.0 - 3.7 (Fermi - Kepler) devices.

      Each cycle each SM warp scheduler picks a warp and issues 1-2 independent instructions.

      The event inst_executed is the count of warp instructions that complete. thread_inst_executed is the count of thread that complete an instruction.

      If the SM is not able to complete the issued instruction due to

      1. constant cache miss on immediate constant (constant referenced in the instruction),
      2. address divergence in an indexed constant load,
      3. address divergence in a global/local memory load or store,
      4. bank conflict in a shared memory load or store,
      5. address conflict in an atomic or reduction operation,
      6. load or store operation require data to be written to the load store unit or read from a unit exceeding the read/write bus width (e.g. 128-bit load or store), or
      7. load cache miss (replay occurs to fetch data when the data is ready in the cache)

      then the SM scheduler has to issue the instruction multiple times. This is called an instruction replay. The value inst_issued == inst_issued2 * 2 + inst_issued1 is the number of instructions completed + instruction replays.

      Instruction replays use an instruction issue slot reducing the compute throughput of the SM.

      The _replay_overhead metrics listed below can help you identify which types of operations are causing replays. The _replay events can provide a magnitude.

      NVPROF/CUPTI EVENTS AND METRICS

      EVENT GROUP 1 - Generic instruction issue and retire count

      • inst_executed: Number of instructions executed, do not include replays.
      • inst_issued1: Number of single instruction issued per cycle
      • inst_issued2: Number of dual instructions issued per cycle
      • inst_issued0: Number of cycles that did not issue any instruction, increments per warp.

      EVENT GROUP 2 - Count or replays for specific types of events listed above (not all events have counts)

      • shared_load_replay: Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all thre ads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).
      • shared_store_replay: Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all th reads in the warp executing that instruction exceed the number of words that can be stored in one cycle.
      • global_ld_mem_divergence_replays: Number of instruction replays for global memory loads. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.
      • global_st_mem_divergence_replays: Number of instruction replays for global memory stores. Instruction is replayed if the instruction is accessing more than one cache line of 128 bytes. For each extra cache line access the counter is incremented by 1.

      METRIC GROUP - Calculation of efficiency.

      • inst_replay_overhead: Average number of replays for each instruction executed
      • local_replay_overhead: Average number of replays due to local memory accesses for each instruction executed
      • atomic_replay_overhead: Average number of replays due to atomic and reduction bank conflicts for each instruction executed
      • global_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
      • shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
      • global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed

      Compute Capability 5.x devices (Maxwell) devices push replays from the warp scheduler to the individual units. This reduces replay latency and frees up the scheduler to issue math operations. The ratio of inst_issued / inst_executed = inst_replay_overhead will usually be close to 0 on these devices.

      这篇关于如何解释CUDA中的指令重放的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆