CUDA - 多处理器、Warp 大小和每个块的最大线程数:确切的关系是什么? [英] CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship?

查看:21
本文介绍了CUDA - 多处理器、Warp 大小和每个块的最大线程数:确切的关系是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 CUDA GPU 上有包含 CUDA 内核的多处理器.在我的工作场所,我正在使用 GTX 590,它包含 512 个 CUDA 内核、16 个多处理器,并且 warp 大小为 32.所以这意味着每个多处理器中有 32 个 CUDA 内核,它们在相同的代码中完全相同地工作.最后,每个块大小的最大线程数是 1024.

I know that there are multiprocessors on a CUDA GPU which contain CUDA cores in them. In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

我的问题是块大小和多处理器计数 - warp 大小是如何完全相关的.说一下我对这种情况的理解:例如我在 GTX 590 上分配了 N 个最大 threadPerBlock 大小为 1024 的块.据我从 CUDA 编程指南和其他来源了解,这些块首先由硬件枚举.在这种情况下,N 个块中的 16 个被分配给不同的多处理器.每个块包含 1024 个线程,硬件调度程序将其中的 32 个线程分配给单个多处理器中的 32 个内核.同一多处理器(warp)中的线程处理同一行代码并使用当前多处理器的共享内存.如果当前的 32 个线程遇到像内存读写这样的片外操作,它们将被当前块中的另一组 32 个线程替换.因此,实际上单个块中有 32 个线程完全在任何给定时间在多处理器上并行运行,而不是全部 1024.最后,如果一个块完全由多处理器处理,N个线程块列表中的一个新线程块被插入到当前的多处理器中.最后,在 CUDA 内核执行期间,GPU 中总共有 512 个线程并行运行.(我知道如果一个块使用的寄存器多于单个多处理器上可用的寄存器,那么它会被划分为在两个多处理器上工作,但在我们的例子中假设每个块都可以适合单个多处理器.)

My question is how the block size and the multiprocessor count - warp size are exactly related. Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors. Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor. The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor. If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024. Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor. And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

那么,我的 CUDA 并行执行模型是否正确?如果没有,有什么问题或遗漏?我想微调我正在处理的当前项目,所以我需要整个事情中最正确的工作模型.

So, is my model of the CUDA parallel execution is correct? If not, what is wrong or missing? I want to fine tune the current project I am working on, so I need the most correct working model of the whole thing.

推荐答案

在我的工作场所,我正在使用 GTX 590,它包含 512 个 CUDA 内核、16 个多处理器,并且扭曲大小为 32.因此这意味着每个多处理器中有 32 个 CUDA 内核,它们在相同的代码上运行相同的经线.最后,每个块大小的最大线程数是 1024.

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

GTX590 包含您提到的数字的 2 倍,因为卡上有 2 个 GPU.下面,我重点介绍单芯片.

A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.

说说我对情况的理解:比如我在GTX 590上分配了N个block,最大threadPerBlock大小为1024.据我从CUDA编程指南和其他来源了解到,首先枚举blocks由硬件.在这种情况下,N 个块中的 16 个被分配给不同的多处理器.

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

块不一定均匀分布在多处理器 (SM) 上.如果您恰好安排了 16 个块,则一些 SM 可以获得 2 或 3 个块,而其中一些则空闲.我不知道为什么.

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.

每个块包含 1024 个线程,硬件调度程序将其中 32 个线程分配给单个多处理器中的 32 个内核.

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

线程和内核之间的关系并不是那么直接.每个 SM 中有 32 个基本"ALU.处理单精度浮点和大多数 32 位整数和逻辑指令的指令.但是只有16个加载/存储单元,所以如果当前正在处理的warp指令是加载/存储,则必须调度两次.而且只有 4 个特殊功能单元,可以做三角函数之类的事情.所以这些指令必须安排 32/4 = 8 次.

The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.

同一多处理器(warp)中的线程处理同一行代码并使用当前多处理器的共享内存.

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

不,一个 SM 中可以同时运行"的线程超过 32 个.

No, there can be many more than 32 threads "in flight" at the same time in a single SM.

如果当前 32 个线程遇到像内存读写这样的片外操作,它们会被当前块中的另一组 32 个线程替换.因此,在一个单独的块中实际上有 32 个线程在任何给定时间在多处理器上并行运行,而不是全部 1024.

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

不,不仅仅是内存操作会导致扭曲被替换.ALU 也是深度流水线的,因此当仍在流水线中的值发生数据依赖关系时,新的扭曲将被交换.因此,如果代码包含两条指令,其中第二条使用第一条的输出,那么当第一条指令的值通过管道时,扭曲将被搁置.

No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.

最后,如果一个块被一个多处理器完全处理,则从 N 个线程块列表中的一个新线程块插入到当前的多处理器中.

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

一个多处理器一次可以处理多个块,但一个块一旦开始处理就不能移动到另一个 MP.块中当前正在运行的线程数取决于该块使用了多少资源.CUDA 占用计算器会根据您的特定内核的资源使用情况告诉您同时有多少块正在运行.

A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.

最后,在 CUDA 内核执行期间,GPU 中总共有 512 个线程并行运行.(我知道如果一个块使用的寄存器多于单个多处理器上可用的寄存器,那么它会被划分为在两个多处理器上工作,但在我们的例子中假设每个块都可以适合单个多处理器.)

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

不,一个块不能被划分为在两个多处理器上工作.整个块总是由单个多处理器处理.如果给定的多处理器没有足够的资源来处理您的内核至少一个块,您将收到内核启动错误并且您的程序将根本无法运行.

No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.

这取决于您如何将线程定义为正在运行".GPU 通常会有超过 512 个线程同时消耗芯片上的各种资源.

It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.

请参阅@harrism 在此问题中的回答:CUDA:总共有多少并发线程?

See @harrism's answer in this question: CUDA: How many concurrent threads in total?

这篇关于CUDA - 多处理器、Warp 大小和每个块的最大线程数:确切的关系是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆