CUDA-多处理器,每个块的经纱大小和最大线程数:确切的关系是什么? [英] CUDA - Multiprocessors, Warp size and Maximum Threads Per Block: What is the exact relationship?

查看:127
本文介绍了CUDA-多处理器,每个块的经纱大小和最大线程数:确切的关系是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道CUDA GPU上有多处理器,其中包含CUDA内核.在我的工作场所中,我正在使用GTX 590,它包含512个CUDA内核,16个多处理器,并且warp大小为32.因此,这意味着每个多处理器中有32个CUDA内核,它们可以在相同的warp中完全相同的代码上工作. .最后,每个块大小的最大线程数为1024.

I know that there are multiprocessors on a CUDA GPU which contain CUDA cores in them. In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

我的问题是块大小与多处理器数量-经纱大小是如何精确相关的.让我告诉我对情况的理解:例如,我在GTX 590上分配了N个最大threadPerBlock大小为1024的块.据CUDA编程指南和其他来源了解,这些块首先由硬件枚举. .在这种情况下,将N个块中的16个分配给不同的多处理器.每个块包含1024个线程,并且硬件调度程序将这些线程中的32个分配给单个多处理器中的32个内核.同一多处理器(warp)中的线程处理同一行代码,并使用当前多处理器的共享内存.如果当前的32个线程遇到诸如内存读写的片外操作,则将它们替换为当前块中的另一组32个线程.因此,单个块中实际上有32个线程在任意给定时间(而不是整个1024个时间)在完全相同的时间内在多处理器上并行运行,而并非整个1024个线程.最后,如果一个块完全由多处理器处理,将N个线程块列表中的新线程块插入当前的多处理器中.最后,在CUDA内核执行期间,GPU中总共有512个线程并行运行. (我知道,如果一个块使用的寄存器多于单个多处理器上可用的寄存器,则将其划分为可在两个多处理器上工作,但在我们的情况下,假设每个块都可以容纳一个单个多处理器.)

My question is how the block size and the multiprocessor count - warp size are exactly related. Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors. Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor. The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor. If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024. Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor. And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

那么,我的CUDA并行执行模型是否正确?如果不是,那是什么错误或缺失?我想微调当前正在处理的项目,所以我需要整个过程中最正确的工作模型.

So, is my model of the CUDA parallel execution is correct? If not, what is wrong or missing? I want to fine tune the current project I am working on, so I need the most correct working model of the whole thing.

推荐答案

在我的工作场所中,我正在使用GTX 590,它包含512个CUDA内核,16个多处理器,并且翘曲大小为32.因此,这意味着每个多处理器中有32个CUDA内核,它们完全可以在相同的代码中工作.同样的经线.最后,每个块大小的最大线程数为1024.

In my workplace I am working with a GTX 590, which contains 512 CUDA cores, 16 multiprocessors and which has a warp size of 32. So this means there are 32 CUDA cores in each multiprocessor which works exactly on the same code in the same warp. And finally the maximum threads per block size is 1024.

由于卡上有2个GPU,所以GTX590包含您提到的数字的2倍.下面,我将重点放在单个芯片上.

A GTX590 contains 2x the numbers you mentioned, since there are 2 GPUs on the card. Below, I focus on a single chip.

让我告诉我对情况的理解:例如,我在GTX 590上分配了N个最大threadPerBlock大小为1024的块.据CUDA编程指南和其他来源了解,首先枚举这些块通过硬件.在这种情况下,将N个块中的16个分配给不同的多处理器.

Let me tell my understanding of the situation: For example I allocate N blocks with the maximum threadPerBlock size of 1024 on the GTX 590. As far as I understand from the CUDA programming guide and from other sources, the blocks are firstly enumerated by the hardware. In this case 16 from the N blocks are assigned to different multiprocessors.

块不必在多处理器(SM)之间平均分配.如果您恰好调度了16个块,则其中一些SM可以获得2或3个块,而其中一些SM处于空闲状态.我不知道为什么.

Block are not necessarily distributed evenly across the multiprocessors (SMs). If you schedule exactly 16 blocks, a few of the SMs can get 2 or 3 blocks while a few of them go idle. I don't know why.

每个块包含1024个线程,硬件调度程序将这些线程中的32个分配给单个多处理器中的32个内核.

Each block contains 1024 threads and the hardware scheduler assigns 32 of these threads to the 32 cores in a single multiprocessor.

线程与内核之间的关系不是那么直接.每个SM中有32个基本" ALU.处理诸如单精度浮点数和大多数32位整数和逻辑指令之类的东西.但是只有16个加载/存储单元,因此,如果当前正在处理的warp指令是加载/存储,则必须将其调度两次.而且只有4个特殊功能单元可以执行三角函数.因此,这些指令必须安排为32/4 = 8次.

The relationship between threads and cores is not that direct. There are 32 "basic" ALUs in each SM. The ones that handle such things as single precision floating point and most 32 bit integer and logic instructions. But there are only 16 load/store units, so if the warp instruction that is currently being processed is a load/store, it must be scheduled twice. And there are only 4 special function units, that do things such as trigonometry. So these instructions must be scheduled 32 / 4 = 8 times.

同一多处理器(warp)中的线程处理同一行代码,并使用当前多处理器的共享内存.

The threads in the same multiprocessor (warp) process the same line of the code and use shared memory of the current multiproccessor.

不,在一个SM中,正在运行"的线程数可能同时超过32个.

No, there can be many more than 32 threads "in flight" at the same time in a single SM.

如果当前的32个线程遇到诸如内存读写的片外操作,则将它们替换为当前块中的另一组32个线程.因此,实际上,在一个块中实际上有32个线程在任何给定时间(而不是整个1024个时间)完全在多处理器上并行运行.

If the current 32 threads encounter an off-chip operation like memory read-writes, they are replaced with an another group of 32 threads from the current block. So, there are actually 32 threads in a single block which are exactly running in parallel on a multiprocessor in any given time, not the whole of the 1024.

不,不仅仅是内存操作会导致翘曲被替换. ALU也进行了深度流水线化,因此当数据依存性仍然存在于流水线中时,新的warp将被交换.因此,如果代码包含两条指令,而第二条指令使用第一条指令的输出,则在第一条指令的值穿过管道的过程中,扭曲将被保留.

No, it is not only memory operations that cause warps to be replaced. The ALUs are also deeply pipelined, so new warps will be swapped in as data dependencies occur for values that are still in the pipeline. So, if the code contains two instructions where the second one uses the output from the first, the warp will be put on hold while the value from the first instruction makes its way through the pipeline.

最后,如果一个块由多处理器完全处理,则将N个线程块列表中的新线程块插入当前的多处理器中.

Finally, if a block is completely processed by a multiprocessor, a new thread block from the list of the N thread blocks is plugged into the current multiprocessor.

一个多处理器一次可以处理一个以上的块,但是一旦开始对其进行处理,则一个块就无法移动到另一个MP.当前正在运行的块中的线程数取决于该块使用多少资源. CUDA占用计算器将根据您特定内核的资源使用情况,告诉您同时运行多少个块.

A multiprocessor can process more than one block at a time but a block cannot move to another MP once processing on it has started. The number of threads in a block that are currently in flight depends on how many resources the block uses. The CUDA Occupancy Calculator will tell you how many blocks will be in flight at the same time based on the resource usage of your specific kernel.

最后,在执行CUDA内核期间,GPU中总共有512个线程并行运行. (我知道,如果一个块使用的寄存器多于单个多处理器上可用的寄存器,则将其划分为可在两个多处理器上工作,但在我们的情况下,假设每个块都可以容纳一个单个多处理器.)

And finally there are a total of 512 threads running in parallel in the GPU during the execution of the CUDA kernel. (I know that if a block uses more registers than available on a single multiprocessor then it is divided to work on two multiprocessors but lets assume that each block can fit into a single multiprocessor in our case.)

否,不能将块划分为在两个多处理器上工作.整个块始终由单个多处理器处理.如果给定的多处理器没有足够的资源来处理内核中的至少一个块,则会出现内核启动错误,并且程序将根本无法运行.

No, a block cannot be divided to work on two multiprocessors. A whole block is always processed by a single multiprocessor. If the given multiprocessor does not have enough resources to process at least one block with your kernel, you will get a kernel launch error and your program won't run at all.

这取决于您如何将线程定义为正在运行".通常,GPU将具有超过512个线程,同时消耗芯片上的各种资源.

It depends on how you define a thread as "running". The GPU will typically have many more than 512 threads consuming various resources on the chip at the same time.

请参阅@harrism在此问题中的答案: CUDA:总共有多少个并发线程?

See @harrism's answer in this question: CUDA: How many concurrent threads in total?

这篇关于CUDA-多处理器,每个块的经纱大小和最大线程数:确切的关系是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆