为什么还要知道CUDA Warps? [英] Why bother to know about CUDA Warps?

查看:366
本文介绍了为什么还要知道CUDA Warps?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有GeForce GTX460 SE,所以它是:6 SM x 48 CUDA内核= 288 CUDA内核。
已知在一个Warp中包含32个线程,并且在一个块中同时(一次)只能执行一个Warp。
也就是说,在单个多处理器(SM)中只能同时执行一个Block,一个Warp和只有32个线程,即使有48个内核可用吗?



此外,分发具体线程和块的示例可以使用threadIdx.x和blockIdx.x。要使用内核<<块,线程>>>()。
但是如何分配特定数量的Warp并分发它们,如果不可能,那么为什么还要知道Warps?

解决方案


$ b $



ALU(内核),加载/存储(LD / ST)单元和特殊功能单位(SFU)(图像中的绿色)是流水线单元。他们保持许多计算或操作的结果在同一时间,在完成的各个阶段。因此,在一个周期中,他们可以接受一个新的操作,并提供在很久以前开始的另一个操作的结果(如果我记得正确,ALU大约有20个周期)。因此,单个SM在理论上具有用于同时处理48×20个周期= 960个ALU操作的资源,这是每个warp = 160个warp的960/32个线程。此外,它可以以任何延迟和吞吐量处理LD / ST操作和SFU操作。



warp调度器(图像中的黄色)可以调度2 * 32每个warp的线程= 64个线程每个周期的流水线。这是每个时钟可以获得的结果数。因此,考虑到存在计算资源,48核,16个LD / ST,8个SFU(每个具有不同的延迟)的混合,正在同时处理经线的混合。在任何给定的周期,经线调度器尝试配对两个经线以调度,以最大化SM的利用率。经线调度器可以从不同块,或者来自同一块中的不同位置,如果指令是独立的。因此,可以同时处理来自多个块的warp。



添加到复杂性,执行指令的warp数量少于32个资源的warp必须是对所有要维护的线程发出多次。例如,有8个SFU,因此意味着包含需要SFU的指令的弯曲必须被调度4次。



该描述被简化。还有其他限制,以确定如何GPU计划工作。您可以通过在网上搜索fermi architecture找到更多信息。



因此,针对您的实际问题,


Warps?


当您尝试最大化算法的性能时,知道一个warp中的线程数并考虑它变得很重要。如果不遵循这些规则,您将失去性能:




  • 在内核调用中, ;< Blocks,Threads>> ,尝试选择一些线程与warp中的线程数量均匀分配。


  • 在您的内核中,尝试让每个线程都遵循相同的代码路径。如果你不这么做,你会得到所谓的经线分歧。这发生是因为GPU必须通过每个发散代码路径运行整个warp。


  • 在你的内核中,并以特定模式存储数据。例如,让线程在全局内存中经历访问连续的32位字。



I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp. That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?

And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> (). But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?

解决方案

The situation is quite a bit more complicated than what you describe.

The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.

The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.

The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.

Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.

This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".

So, coming to your actual question,

why bother to know about Warps?

Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:

  • In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.

  • In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.

  • In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.

这篇关于为什么还要知道CUDA Warps?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆