Cuda线程调度-延迟隐藏 [英] Cuda thread scheduling - latency hiding

查看:67
本文介绍了Cuda线程调度-延迟隐藏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

何时从全局内存执行读取的cuda线程(或整个warp)由调度程序置于睡眠状态?假设我在内存读取后立即在内核中进行一些不依赖于读取数据的计算.当尚未从全局读取的数据还不存在时可以执行这些操作吗?

When is a cuda thread (or a whole warp), that performs a read from global memory, put to sleep by the scheduler? Let's say I do some computations in the kernel, right after the memory read, that do not depend on the read data. Can these be executed while the data from the global read isn't there yet?

推荐答案

本身读取的内存不会导致停顿(除非LD/ST单元不可用).

A memory read by itself does not cause a stall (barring the cases where the LD/ST unit is unavailable).

当该内存读取操作的结果需要其他操作使用时,将发生线程停顿.

The thread stall will occur when the result of that memory read operation needs to be used by another operation.

编译器意识到这一点,并将尝试对独立的(SASS)指令进行重新排序,以便在读取后跟随独立的指令.

The compiler is aware of this and will attempt to reorder independent (SASS) instructions so that a read will be followed by independent instructions.

但是,一旦编译了代码,指令顺序就不会改变(CUDA GPU当前不执行推测性执行或无序执行).因此,一旦在(SASS)指令流中发生了依赖于读取的操作,该线程将停止直到读取操作完成.(1)

However, once code is compiled, the instruction sequence is not altered (CUDA GPUs currently do not perform speculative execution or out-of-order execution). So once the operation that depends on the read occurs in the (SASS) instruction stream, that thread will stall until the read operation is complete. (1)

因此,如果您执行了以下操作:

Therefore if you did something like this:

float a = global_data[idx];
float b = c*d;
a = a*b;

然后上述代码的第1行不会导致线程停止.假设 c d 准备就绪/可用,第二行不会造成停顿.如果在遇到该行时尚未从全局内存中检索到 a 的值,则第3行将导致停顿.(由于它也取决于 b ,因此当 b 通过乘法管道时,可能会有一些算术延迟-可能是停顿-但这种算术延迟可能会比全局内存延迟短得多.)

Then line 1 of the above code will not cause a thread stall. Line 2 will not cause a stall assuming c and d are ready/available. Line 3 will cause a stall if the value of a has not been retrieved from global memory by the time that line is encountered. (Since it also depends on b, there will probably be some arithmetic latency -- possibly a stall -- while b is passing through the multiply pipe, but this arithmetic latency may be much shorter than global memory latency.)

如上所述,即使您不以这种方式编写代码,编译器通常也会尝试对独立操作进行重新排序,以使情况更加有利.例如,如果您以这种方式编写代码:

As already mentioned, even if you don't write code this way, the compiler will generally attempt to re-order independent operations such that the situation is more favorable. For example if you wrote the code this way:

float b = c*d;
float a = global_data[idx];
a = a*b;

很有可能底层的SASS代码可能没有太大的不同.即使您执行以下操作:

it's quite possible the underlying SASS code might not be significantly different. Even if you do something like this:

float b = c*d;
float a = global_data[idx]*b;

编译器会将代码的第二行分解为(至少)两个单独的操作:将 global_data [idx] 加载到寄存器中,然后进行乘法运算.同样,这些实现中的任何一个中的基础SASS代码可能都没有实质性的不同.

the compiler will break the second line of code into (at least) two separate operations: the load of global_data[idx] into a register, followed by a multiply operation. Again, the underlying SASS code in any of these realizations may not be substantially different.

(1)Fermi cc2.1和cc3.x及更高版本的SM通常具有多重问题的能力,即.超标量操作.这意味着可以根据资源限制和约束,在同一周期内调度来自同一指令流,针对相同扭曲的多个(独立SASS)指令.我不认为此类多发案件与关于投机或OOO执行的陈述相抵触,我也不认为这会对上述讨论产生实质性影响.一旦线程停顿了,即在指令调度器"机制范围内发布指令的机会已经干",那么在停顿被移除之前,不能/将不计划进一步的指令.有关多问题机制的功能和局限性的低级详细信息尚未发布.

(1) Fermi cc2.1 and cc3.x and higher SMs generally have the capability for multiple issue, ie. superscalar operation. This means that multiple (independent SASS) instructions from the same instruction stream, for the same warp, can be scheduled in the same cycle subject to resource limits and restrictions. I don't consider such multiple-issue cases to contradict the statements about speculative or OOO execution, and I don't consider that to materially impact the discussion above. Once a thread has stalled, i.e. the opportunity to issue instructions within the confines of the instruction scheduler mechanism has "dried up", then no further instructions can/will be scheduled until the stall is removed. Low-level details of the capabilities and limitations of the multiple-issue mechanism are unpublished AFAIK.

幻灯片14 此处可能会引起关注.

Slide 14 here may be of interest.

这篇关于Cuda线程调度-延迟隐藏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆