在CUDA中重播指令的其他原因 [英] Other reasons for instruction replays in CUDA

查看:162
本文介绍了在CUDA中重播指令的其他原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我从nvprof(CUDA 5.5)获得的输出:

This is the output I get from nvprof (CUDA 5.5):

Invocations                 Metric Name              Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
Kernel: MyKernel(double const *, double const *, double*, int, int, int)
     60            inst_replay_overhead     Instruction Replay Overhead    0.736643    0.925197    0.817188
     60          shared_replay_overhead   Shared Memory Replay Overhead    0.000000    0.000000    0.000000
     60          global_replay_overhead   Global Memory Replay Overhead    0.108972    0.108972    0.108972
     60    global_cache_replay_overhead  Global Memory Cache Replay Ove    0.000000    0.000000    0.000000
     60           local_replay_overhead  Local Memory Cache Replay Over    0.000000    0.000000    0.000000
     60                gld_transactions        Global Load Transactions       25000       25000       25000
     60                gst_transactions       Global Store Transactions       75000       75000       75000
     60  warp_nonpred_execution_efficie  Warp Non-Predicated Execution       99.63%      99.63%      99.63%
     60                       cf_issued  Issued Control-Flow Instructio       44911       45265       45101
     60                     cf_executed  Executed Control-Flow Instruct       39533       39533       39533
     60                     ldst_issued  Issued Load/Store Instructions      273117      353930      313341
     60                   ldst_executed  Executed Load/Store Instructio       50016       50016       50016
     60              stall_data_request  Issue Stall Reasons (Data Requ      65.21%      68.93%      67.86%
     60                   inst_executed           Instructions Executed      458686      458686      458686
     60                     inst_issued             Instructions Issued      789220      879145      837129
     60                     issue_slots                     Issue Slots      716816      803393      759614

内核使用356字节cmem [0],并且没有共享内存。而且,没有寄存器溢出。
我的问题是,在这种情况下重播指令的原因是什么?我们看到的开销为81%,但数字却没有相加。

The kernel uses 356 bytes cmem[0] and no shared memory. Also, no register spills. My question is, what is the reason for instruction replays in this case? We see an overhead of 81% but the numbers do not add up.

谢谢!

推荐答案

一些可能的原因:


  1. 共享存储库冲突(您没有)

  2. 常量内存冲突(即,warp中的不同线程从同一指令中请求常量内存中的不同位置)

  3. warp-divergent代码(如果..then..else在弯道中为不同的线程采用不同的路径)

演示文稿可能很有趣,尤其是幻灯片8-11。

This presentation may be of interest, especially slides 8-11.

这篇关于在CUDA中重播指令的其他原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆