什么是CUDA中的指令重放开销 [英] What Causes Instruction Replay Overhead in CUDA

查看：266 发布时间：2017/3/4 16:24:27 memory cuda gpu overhead

本文介绍了什么是CUDA中的指令重放开销的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在我的CUDA应用程序上运行了可视化分析器。如果数据太大，应用程序会多次调用单个内核。这个内核没有分支。

I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching.

分析器报告 83.6％的高指令重放开销和高全局内存指令重放

The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5%.

这里是内核一般的外观：

Here is how the kernel generally looks:

// Decryption kernel __global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){ __shared__ volatile word sdata[256]; register uint32_t data; // Thread ID #define xID (threadIdx.x + blockIdx.x * blockDim.x) #define yID (threadIdx.y + blockIdx.y * blockDim.y) uint32_t tid = xID + yID * blockDim.x * gridDim.x; #undef xID #undef yID register uint32_t pos4 = tid%4; register uint32_t pos256 = tid%256; uint32_t blk = pos256&0xFC; // Indices register uint32_t index0 = blk + (pos4+3)%4; register uint32_t index1 = blk + (pos4+2)%4; // Read From Global Memory b0[pos256] = ((word*)in_blk)[tid+4] ^ dev_key[pos4]; data = tab(0,sdata[index0]); data ^= tab(1,sdata[index1]); sdata[pos256] = data ^ tab2[pos4]; data = tab(0,sdata[index0]); data ^= tab(1,sdata[index1]); sdata[pos256] = data ^ tab2[2*pos4]; data = tab(0,sdata[index0]); data ^= tab(1,sdata[index1]); data ^= tab2[3*pos4]; ((uint32_t*)out_blk)[tid] = data + ((uint32_t*)in_blk)[tid]; }

正如你所看到的那样没有分支。线程将首先基于线程ID + 16字节从全局存储器读取。

As you can see there are no branches. The threads will initially read from global memory based on thread ID + 16 bytes. They will then write to an output buffer after performing an operation with data from global memory based on their thread ID.

任何想法为什么这个内核会有这么多的开销？ / p>

Any ideas why this kernel would have so much overhead?

推荐答案

在这种情况下，指令重放的源是warp内的非均匀常数内存访问。在代码中， tab 存储在常量内存中，并根据线程索引和数据存储共享内存的某种组合进行索引。结果将看起来是在同一翘曲内的非均匀访问线程。恒定存储器实际上旨在用于warp中的所有线程访问相同字的情况，然后该值可以在单个操作中从常量存储器高速缓存广播，否则发生warp串行化。

The source of the instruction replay in this case is non-uniform constant memory access within a warp. In you code, tab is stored in constant memory and indexed according to some combination of thread index and data stored shared memory. The result would appear to be non-uniform access threads within the same warp. Constant memory is really intended for cases where all threads in a warp access the same word, then the value can be broadcast from constant memory cache in a single operation, otherwise warp serialization occurs.

在需要小型，只读数据集的非均匀访问的情况下，可能最好将数据绑定到纹理，而不是存储它是常量内存。

In cases where non-uniform access of small, read-only datasets is required, it would probably be better to bind the data to a texture than store it is constant memory.

这篇关于什么是CUDA中的指令重放开销的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是CUDA中的指令重放开销 [英] What Causes Instruction Replay Overhead in CUDA

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

什么是CUDA中的指令重放开销 [英] What Causes Instruction Replay Overhead in CUDA

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭