CUDA 块/扭曲/线程如何映射到 CUDA 内核? [英] How do CUDA blocks/warps/threads map onto CUDA cores?

查看:29
本文介绍了CUDA 块/扭曲/线程如何映射到 CUDA 内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用CUDA几个星期了,但是我对blocks/warps/thread的分配有些疑问.我正在从教学的角度研究架构(大学项目),因此达到最佳性能不是我关心的问题.

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread. I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.

首先,我想了解这些事实是否正确:

First of all, I would like to understand if I got these facts straight:

  1. 程序员编写内核,并在线程块网格中组织其执行.

  1. The programmer writes a kernel, and organize its execution in a grid of thread blocks.

每个块都分配给一个流式多处理器 (SM).一旦分配,它就不能迁移到另一个 SM.

Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.

每个 SM 将自己的块拆分为 Warps(当前最大大小为 32 个线程).warp 中的所有线程在 SM 的资源上同时执行.

Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.

线程的实际执行由 SM 中包含的 CUDA Core 执行.线程和内核之间没有特定的映射关系.

The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.

如果一个 warp 包含 20 个线程,但当前只有 16 个内核可用,则该 warp 将不会运行.

If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.

另一方面,如果一个块包含 48 个线程,它将被分成 2 个线程束,并且如果有足够的可用内存,它们将并行执行.

On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.

如果一个线程在一个内核上启动,那么它会因内存访问或长时间浮点操作而停止,它的执行可能会在另一个内核上恢复.

If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.

它们是正确的吗?

现在,我有一个 GeForce 560 Ti,所以根据规格它配备了 8 个 SM,每个包含 48 个 CUDA 核心(总共 384 个核心).

Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).

我的目标是确保架构的每个核心都执行相同的指令.假设我的代码不需要比每个 SM 中可用的更多寄存器,我设想了不同的方法:

My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:

  1. 我创建了 8 个块,每个块有 48 个线程,因此每个 SM 有 1 个块要执行.在这种情况下,48 个线程是否会在 SM 中并行执行(利用所有可用的 48 个内核)?

  1. I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?

如果我启动 64 个 6 个线程的块有什么不同吗?(假设它们将在 SM 之间均匀映射)

Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)

如果我将 GPU 淹没"在预定的工作中(例如,创建 1024 个块,每个块有 1024 个线程)是否可以合理地假设所有内核都将在某个点使用,并且会执行相同的操作计算(假设线程永远不会停止)?

If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?

有没有办法使用分析器检查这些情况?

Is there any way to check these situations using the profiler?

这东西有参考吗?我阅读了 CUDA 编程指南以及大规模并行处理器编程"和CUDA 应用程序设计与开发"中专门针对硬件架构的章节;但我无法得到准确的答案.

Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.

推荐答案

两个最好的参考是

  1. NVIDIA Fermi 计算架构白皮书
  2. GF104 评论

我会尽力回答你的每一个问题.

I'll try to answer each of your questions.

程序员将工作划分为线程,线程划分为线程块,线程块划分为网格.计算工作分配器将线程块分配给流式多处理器 (SM).一旦一个线程块被分配给一个 SM,线程块的资源就会被分配(warp 和共享内存)并且线程被分成 32 个线程的组,称为 warp.一旦分配了一个warp,它就被称为一个活动的warp.两个 warp 调度器每个周期选择两个活动的 warp 并将 warp 分派到执行单元.有关执行单元和指令调度的更多详细信息,请参阅 1 p.7-10 和 2.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

4'.laneid(warp 中的线程索引)和核心之间存在映射.

4'. There is a mapping between laneid (threads index in a warp) and a core.

5'.如果一个 warp 包含少于 32 个线程,在大多数情况下,它的执行方式与它有 32 个线程一样.由于以下几个原因,warp 的活动线程可能少于 32 个:每个块的线程数不能被 32 整除,程序执行一个发散块,因此未采用当前路径的线程被标记为不活动,或者 warp 中的线程退出.

5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

6'.一个线程块将被划分为WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1)/WarpSizewarp 调度器不需要从同一个线程块中选择两个 warp.

6'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block.

7'.执行单元不会因内存操作而停止.如果当一条指令准备好被调度时资源不可用,则该指令将在未来资源可用时再次调度.Warp 可能会在障碍处、内存操作、纹理操作、数据依赖关系等方面停滞不前……停滞的 warp 没有资格被 warp 调度程序选择.在 Fermi 上,每个周期至少有 2 个符合条件的 warp 是很有用的,以便 warp 调度程序可以发出指令.

7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

参见参考2 了解 GTX480 和 GTX560 之间的差异.

See reference 2 for differences between a GTX480 and GTX560.

如果您阅读参考资料(几分钟),我想您会发现您的目标没有意义.我会尽力回应你的观点.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'.如果您启动 kernel<<<8, 48>>>,您将获得 8 个块,每个块有 2 个 32 和 16 个线程的扭曲.无法保证这 8 个块将分配给不同的 SM.如果将 2 个块分配给 SM,则每个 warp 调度程序都可以选择一个 warp 并执行该 warp.您将只使用 48 个内核中的 32 个.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'.8块48线程和64块6线程有很大区别.假设您的内核没有分歧,并且每个线程执行 10 条指令.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

  • 8 个块,48 个线程 = 16 条扭曲 * 10 条指令 = 160 条指令
  • 具有 6 个线程的 64 个块 = 64 条扭曲 * 10 条指令 = 640 条指令

为了获得最佳效率,工作分工应该是 32 个线程的倍数.硬件不会合并来自不同经线的线程.

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'.如果内核没有最大化寄存器或共享内存,GTX560 一次可以有 8 个 SM * 8 个块 = 64 个块或 8 个 SM * 48 个扭曲 = 512 个扭曲.在任何给定时间,部分工作都将在 SM 上处于活动状态.每个 SM 都有多个执行单元(多于 CUDA 内核).在任何给定时间使用哪些资源取决于应用程序的扭曲调度程序和指令组合.如果您不进行 TEX 操作,那么 TEX 单元将处于空闲状态.如果您不进行特殊的浮点运算,SUFU 单元将处于空闲状态.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'.Parallel Nsight 和 Visual Profiler 展示

4'. Parallel Nsight and the Visual Profiler show

一个.执行IPC

b.发布IPC

c.每个活动周期的活动扭曲

c. active warps per active cycle

d.每个活动周期的合格经纱(仅限 Nsight)

d. eligible warps per active cycle (Nsight only)

e.翘曲失速原因(仅限 Nsight)

e. warp stall reasons (Nsight only)

f.每条指令执行的活动线程数

f. active threads per instruction executed

分析器不显示任何执行单元的利用率百分比.对于 GTX560,粗略估计为 IssuedIPC/MaxIPC.对于 MaxIPC 假设GF100(GTX480)为2GF10x (GTX560) 是 4 但目标是 3 是更好的目标.

The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

这篇关于CUDA 块/扭曲/线程如何映射到 CUDA 内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆