什么是GPU中的上下文切换机制? [英] What is the context switching mechanism in GPU?

查看:1208
本文介绍了什么是GPU中的上下文切换机制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道,GPU在经线之间切换以隐藏内存延迟。但我不知道在哪种情况下,翘曲会被切换出来?例如,如果warp执行加载,并且数据已经在缓存中。因此,翘曲是否被切断或继续下一个计算?如果有两个连续的添加,会发生什么?
感谢

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds? Thanks

推荐答案

首先,一旦线程块在多处理器经常驻留,直到他们都退出内核。因此,一个块不会启动,直到有足够的寄存器用于块的所有卷曲,并且直到有足够的空闲共享内存为块。

First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.

所以经线永远不是切换出 - 在传统意义上,没有交叉扭曲上下文切换,其中上下文切换需要将寄存器保存到存储器并恢复它们。

So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.

然而,SM确实选择从所有驻留经纱中发出的指令。实际上,不管它们具有多少ILP(指令级并行性),SM更可能从不同的warp中发出与来自相同warp的行相同的两个指令,而不管它们是什么类型的指令。不这样做会将SM暴露给依赖性阻塞。即使诸如添加的快速指令具有非零延迟,因为算术流水线是多个周期长。例如,在费米上,硬件可以在每个周期(峰值)发出2个或更多的扭曲指令,并且算术流水线延迟为〜12个周期。因此,在飞行中需要多个经线,以隐藏算术延迟,而不仅仅是内存延迟。

The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.

一般来说,warp调度的细节是与架构无关的,几乎保证随时间改变。 CUDA编程模型独立于调度算法,你不应该依赖它在你的软件。

In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

这篇关于什么是GPU中的上下文切换机制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆