为什么要费心去了解 CUDA Warps? [英] Why bother to know about CUDA Warps?

查看:19
本文介绍了为什么要费心去了解 CUDA Warps?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 GeForce GTX460 SE,所以它是:6 SM x 48 CUDA 核心 = 288 CUDA 核心.众所周知,一个 Warp 中包含 32 个线程,并且在一个块中同时(一次)只能执行一个 Warp.也就是在单个多处理器(SM)中只能同时执行一个Block、一个Warp和只有32个线程,即使有48个内核可用?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores. It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp. That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?

另外,可以使用threadIdx.x和blockIdx.x来分配具体的Thread和Block的例子.要分配它们,请使用内核 <<<块,线程>>>().但是如何分配特定数量的 Warp 并分发它们,如果不可能,那为什么还要费心去了解 Warp 呢?

And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> (). But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?

推荐答案

情况比你描述的要复杂一些.

The situation is quite a bit more complicated than what you describe.

ALU(核心)、加载/存储 (LD/ST) 单元和特殊功能单元 (SFU)(图中的绿色)是流水线单元.它们在完成的不同阶段同时保留许多计算或操作的结果.因此,在一个周期中,他们可以接受一项新操作并提供很久以前开始的另一项操作的结果(如果我没记错的话,ALU 大约需要 20 个周期).因此,理论上单个 SM 具有同时处理 48 * 20 个周期 = 960 个 ALU 操作的资源,即每个 warp 有 960/32 个线程 = 30 个 warp.此外,它可以处理任何延迟和吞吐量的 LD/ST 操作和 SFU 操作.

The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.

warp 调度程序(图中的黄色)可以为每个 warp 调度 2 * 32 个线程 = 64 个线程到每个周期的管道.这就是每个时钟可以获得的结果数量.因此,考虑到计算资源的混合,48 个核心、16 个 LD/ST、8 个 SFU,每个都有不同的延迟,同时处理混合的扭曲.在任何给定的周期,warp 调度器都会尝试将两个 warp配对"来调度,以最大限度地利用 SM.

The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.

如果指令是独立的,warp 调度程序可以从不同的块或同一块中的不同位置发出 warp.因此,可以同时处理来自多个块的扭曲.

The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.

更复杂的是,执行指令的资源少于 32 个的 warp 必须多次发出以供所有线程服务.例如,有 8 个 SFU,这意味着包含需要 SFU 的指令的 warp 必须被调度 4 次.

Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.

此描述已简化.还有其他一些限制也在起作用,它们决定了 GPU 如何安排工作.您可以通过在网上搜索fermi architecture"找到更多信息.

This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".

所以,来到你的实际问题,

So, coming to your actual question,

为什么要费心去了解 Warps?

why bother to know about Warps?

当您尝试最大限度地提高算法的性能时,了解 warp 中的线程数并将其考虑在内变得很重要.如果您不遵守这些规则,就会失去性能:

Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:

  • 在内核调用中,<<<Blocks, Threads>>>,尽量选择一个线程数与a中的线程数等分经.如果不这样做,您最终会启动一个包含非活动线程的块.

  • In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.

在你的内核中,尝试让 warp 中的每个线程都遵循相同的代码路径.如果你不这样做,你就会得到所谓的经线发散.发生这种情况是因为 GPU 必须通过每个不同的代码路径运行整个 warp.

In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.

在你的内核中,尝试让每个线程都以特定的模式加载和存储数据.例如,让 warp 中的线程访问全局内存中的连续 32 位字.

In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.

这篇关于为什么要费心去了解 CUDA Warps?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆