如何理解“ warp中的所有线程同时执行相同的指令”。在GPU中? [英] How to understand "All threads in a warp execute the same instruction at the same time." in GPU?

查看:474
本文介绍了如何理解“ warp中的所有线程同时执行相同的指令”。在GPU中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读专业CUDA C编程,然后在 GPU体系结构概述部分:

I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:


CUDA采用单指令多线程(SIMT)架构来管理和以32个称为warp的组执行线程。线程束中的所有线程同时执行同一条指令。每个线程都有其自己的指令地址计数器和寄存器状态,并根据自己的数据执行当前指令。每个SM将分配给它的线程块划分为32个线程扭曲,然后调度在可用硬件资源上执行。

CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.

SIMT体系结构类似于SIMD(单指令,多数据) )架构。 SIMD和SIMT都通过将同一条指令广播到多个执行单元来实现并行性。一个关键的区别是SIMD要求向量中的所有向量元素必须在未分配的同步组中一起执行,而SIMT允许同一线程束中的多个线程独立执行。即使扭曲中的所有线程都从同一程序地址一起开始,但是各个线程可能具有不同的行为。 SIMT使您可以为独立的标量线程编写线程级并行代码,并为协调线程编写数据并行代码。 SIMT模型包含SIMD不具备的三个关键功能:

➤每个线程都有自己的指令地址计数器。

➤每个线程都有自己的寄存器状态。

➤每个线程可以有一个独立的执行路径。

The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.

第一段提到 warp同时执行同一条指令。,而在第二段中,它表示 即使warp中的所有线程一起在同一程序地址处启动,则各个线程可能具有不同的行为。。这让我感到困惑,并且以上陈述似乎矛盾。谁能解释?

The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?

推荐答案

没有矛盾。线程束中的所有线程始终在锁步中执行同一条指令。为了支持条件执行和分支,CUDA在SIMT模型中引入了两个概念

There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model


  1. 谓词执行(请参见此处

  2. 指令重播/序列化(请参见此处

  1. Predicated execution (See here)
  2. Instruction replay/serialisation (See here)

谓词执行意味着条件指令的结果可用于屏蔽线程执行没有分支的后续指令。指令重播是如何处理经典的条件分支。所有线程都通过重播指令来执行条件执行代码的所有分支。不遵循特定执行路径的线程将被屏蔽,并执行与NOP等效的操作。这就是CUDA中所谓的分支分歧惩罚,因为它对性能有重大影响。

Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.

这是锁步执行可以支持分支的方式。

This is how lock-step execution can support branching.

这篇关于如何理解“ warp中的所有线程同时执行相同的指令”。在GPU中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆