GPU中的上下文切换机制是什么? [英] What is the context switching mechanism in GPU?

查看:26
本文介绍了GPU中的上下文切换机制是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,GPU 在扭曲之间切换以隐藏内存延迟.但我想知道在什么情况下,扭曲会被关闭?例如,如果一个扭曲执行加载,并且数据已经在缓存中.那么warp是关闭还是继续下一个计算?如果有两个连续添加会发生什么?谢谢

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds? Thanks

推荐答案

首先,一旦一个线程块在多处理器 (SM) 上启动,它的所有 warp 都会驻留,直到它们全部退出内核.因此,在有足够的寄存器用于块的所有扭曲之前,不会启动一个块,直到有足够的空闲共享内存用于该块.

First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.

因此,warp 永远不会切换出去"——没有传统意义上的 warp 间上下文切换,其中上下文切换需要将寄存器保存到内存并恢复它们.

So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.

然而,SM 确实会从所有常驻经线中选择要发布的指令.事实上,SM 更有可能从不同的 warp 连续发出两条指令,而不是从同一个 warp,无论它们是什么类型的指令,无论有多少 ILP(指令级并行度).不这样做会使 SM 暴露于依赖停顿.即使像加法这样的快速"指令也具有非零延迟,因为算术流水线是多个周期长.例如,在 Fermi 上,硬件可以在每个周期(峰值)发出 2 条或更多条扭曲指令,并且算术流水线延迟约为 12 个周期.因此,您需要在飞行中使用多个扭曲来隐藏算术延迟,而不仅仅是内存延迟.

The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.

一般来说,warp 调度的细节取决于架构,没有公开记录,并且几乎可以保证随着时间的推移而改变.CUDA 编程模型独立于调度算法,您不应在软件中依赖它.

In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

这篇关于GPU中的上下文切换机制是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆