CUDA块&变形 [英] CUDA Blocks & Warps

查看:163
本文介绍了CUDA块&变形的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我知道相关的问题已被反复问过,我几乎阅读了我发现的一切,但事情仍不清楚。也许是因为我发现和阅读彼此矛盾的东西(也许是因为,从不同的时代,他们指的是具有不同计算能力的设备,它们之间似乎有很大的差距)。我想要更有效率,减少我的执行时间,因此我需要知道确切多少线程/ warp /块可以并行运行。我还在考虑泛化这个和计算传递给我的内核的线程和块的最佳数量只基于我知道我需要做的操作数量(对于更简单的程序)和系统规格。



我有一个GTX 550Ti,btw具有计算能力2.1。
4 SMs x 48 core = 192 CUDA核心。



好吧,我不清楚的是:



可以在多处理器(SM)上一次(并行)运行多于1个块?我读了多达8块可以分配给一个SM,但没有关于它们是如何运行的。由于我的每个SM的最大线程数(1536)几乎大于我的每个块的最大线程数(1024)我认为块不是并行运行(可能是一个半?)。或者至少不是如果我有最大数量的线程。此外,如果我将块数设置为,让我们假设4(我的SM数量),它们将被发送到不同的SM?
或者我不能真正控制这是如何分布在硬件上,然后这是一个疑问点,我的执行时间将根据我的设备的奇怪...



其次,我知道一个块将它的线程划分为32个线程组,并行运行,称为warp。现在这些经线(假设他们彼此没有关系)可以并行运行?因为在费米体系结构中,它表示同时执行2个warp,从每个warp到一组16(?)核心发送一个指令,而其他地方我读取每个核心处理warp,这将解释1536最大线程32 * 48),但似乎有点多。可以1个CUDA核心并发处理32个线程吗?



更简单的说,我要求的是:一个,我应该给他们什么长度(nr的操作),我应该如何分割它们在块和线程,我的设备并行(并行)以满容量(没有空闲的内核或SM)。



对不起,如果以前问我,我没有得到它,或没有看到它。希望你能帮助我。谢谢!

解决方案

工作的分配和并行执行由启动配置和设备决定。启动配置指定网格尺寸,块尺寸,每个线程的寄存器和每个块的共享内存。基于此信息和设备,您可以确定可以在设备上同时执行的块和卷的数量。当开发内核时,您通常查看SM上可以处于活动状态的翘曲与设备的每个SM的最大翘曲数之比。这被称为理论占用率。 CUDA入住率计算器可用于调查不同的启动配置。



当网格被启动时,计算工作分配器将栅格化网格并将线程块分配给SM,并且SM资源将被分配给线程块。如果SM具有足够的资源,多个线程块可以在SM上同时执行。



为了启动warp,SM将warp分配给warp调度器并分配寄存器为翘曲。



每个warp调度器管理一组warp(在Fermi上为24,在Kepler上为16)。没有停顿的经线称为合格经线。在每个周期上,warp调度器为执行单元(如int / fp单元,双精度浮点单元,特殊函数单元,分支解析单元和加载存储单元)挑选合适的warp和issue指令。执行单元是流水线的,允许许多经线在每个周期中具有1个或多个指令。在指令获取,数据依赖,执行依赖,障碍等方面,可能会停止变形。



每个内核都有不同的最佳启动配置。诸如Nsight Visual Studio Edition和NVIDIA Visual Profiler等工具可以帮助您调整启动配置。我建议您尝试以灵活的方式编写您的代码,以便您可以尝试多个启动配置。我将首先使用一个配置,让您至少有50%的占用率,然后尝试增加和减少占用率。



每个问题的答案 / p>

问:在多处理器(SM)上一次可以并行运行多于1个块?



是,最大数量是基于设备的计算能力。 请参阅表10.每个计算能力的技术规格:最大数量的每个多处理器的居民块来确定该值。通常,启动配置限制运行时间值。有关更多详细信息,请参阅占用计算器或其中一个NVIDIA分析工具。



问:由于我的每个SM(1536)的最大线程数是几乎不大于我的每个块的最大线程数(1024)我认为块不是并行运行(可能是一个半。)



启动配置确定每个SM的块数。每个块的最大线程数与每个SM的最大线程数的比率被设置为允许开发人员在分区工作方式上更灵活。



的块到,让我们说4(我的SM数量),他们将被发送到不同的SM每个?或者我不能真正控制这是如何分布在硬件上,然后这是一个疑问点,我的执行时间将根据我的设备的奇怪...



您对工作分配的控制有限。你可以人为地控制这一点,通过分配更多的共享内存来限制占用,但这是一个高级优化。



问:其次,我知道一个块将它线程分成并行运行的32个线程的组,称为warp。现在这些经线(假设它们彼此没有关系)也可以并行运行?



是的,经线可以并行运行。 p>

问:因为在费米架构中,它表示同时执行2个经纱



每个Fermi SM有2个经线调度器。每个warp调度器可以在每个周期调度一个warp的指令。指令执行是流水线的,所以许多经线在每个周期可以有一个或多个指令。



Q:从一个warp到一组16 ?)核心,而其他地方我读的每个核心处理一个warp,这将解释1536最大线程(32x48),但似乎有点多。 1个CUDA核心能同时处理32个线程吗?



是的。 CUDA核心是整数和浮点执行单元的数量。 SM具有上面列出的其他类型的执行单元。 GTX550是CC 2.1设备。在每个周期,一个SM有可能每个周期最多发送4条指令(128个线程)。根据执行的定义,每个循环中的总线程数可以从几百到几千。


Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.

I have a GTX 550Ti, btw with compute capability 2.1. 4 SMs x 48 cores = 192 CUDA cores.

Ok so what's unclear to me is:

Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...

Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?

On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).

I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!

解决方案

The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.

When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.

In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.

Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.

Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.

Answers to each Question

Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?

Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.

Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).

The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.

Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...

You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.

Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?

Yes, warps can run in parallel.

Q: Because in the Fermi architecture it states that 2 warps are executed concurrently

Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.

Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?

Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.

这篇关于CUDA块&变形的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆