CUDA块/ warp /线程如何映射到CUDA核心? [英] How do CUDA blocks/warps/threads map onto CUDA cores?

查看:248
本文介绍了CUDA块/ warp /线程如何映射到CUDA核心?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用CUDA几个星期,但我有一些疑虑的块/ warp /线程的分配。
我从教学角度研究建筑(大学项目),所以达到最高的表现不是我的关注。



首先,我想知道我是否直接得到这些事实:


  1. 程序员写了一个内核,


  2. 每个块都分配给流处理多处理器(SM)。一旦分配,它不能迁移到另一个SM。


  3. 每个SM将其自己的块分割成经线(目前最大为32个线程)。一个warp中的所有线程在SM的资源上同时执行。


  4. 线程的实际执行由包含在SM中的CUDA核心执行。


  5. 如果一个warp包含20个线程,但目前只有16个核心可用,warp不会运行。


  6. 另一方面,如果一个块包含48个线程,它将被分成2个warp,并且只要有足够的内存,它们将并行执行。 / p>


  7. 如果一个线程在内核上启动,那么当内存访问或长浮点操作被停止时,它的执行可以在不同的内核上恢复。 / p>


它们是否正确?



GeForce 560 Ti根据规格,它配备了8 SM,每个包含48个CUDA内核(总共384个内核)。



我的目标是确保每个该架构的核心执行SAME指令。假设我的代码不需要比每个SM中可用的更多的寄存器,我想象了不同的方法:


  1. 每个块有48个线程,使每个SM有1个块来执行。在这种情况下,48个线程在SM中并行执行(利用所有可用的48个核心)?


  2. 块6个线程? (假设它们将在SM之间均匀映射)


  3. 如果我将工作淹没在CPU中(创建1024块1024个线程, )假设结构中的所有核都将在某一点被使用,并且将执行相同的计算(假设线程从不停止)是合理的。


  4. 有什么方法可以使用分析器来检查这些情况吗?


  5. 我在大规模并行处理器编程和CUDA应用程序设计与开发中阅读了CUDA编程指南和专用于硬件体系结构的章节。但我无法得到准确的答案。



解决方案

参考资料


  1. NVIDIA Fermi计算架构白皮书

  2. GF104评论

我会尽量回答您的每个问题。



程序员将工作分为线程,线程分为线程块,线程块分为网格。计算工作分配器将线程块分配给流多处理器(SM)。一旦线程块被分配给SM,则分配用于线程块的资源(经线和共享存储器),并且线程被划分成称为warp的32个线程的组。一旦分配了warp,它被称为活动warp。两个经线调度器每个周期选择两个活动warp,并将warp分派给执行单元。有关执行单元和指令分派的详细信息,请参见 1 第7-10页和 2



4'。在laneid(经线中的线索索引)和核心之间存在映射。



5'。如果warp包含少于32个线程,它将在大多数情况下执行相同,就像它有32个线程。由于以下几个原因,变形可以具有少于32个活动线程:每个块的线程数不能被32整除,程序执行发散块,因此没有采取当前路径的线程被标记为不活动,或者经线中的线程退出。



6'。线程块将划分为
WarpsPerBlock =(ThreadsPerBlock + WarpSize - 1)/ WarpSize
不需要经线调度器从同一线程块中选择两个warp。



7'。执行单元不会在存储器操作上停止。如果在指令准备好被分派时资源不可用,则当资源可用时将在将来再次分派指令。变形可以在障碍,存储器操作,纹理操作,数据依赖性上停滞。失速的变形不适合由warp调度器选择。在费米,每个周期至少有2个合格的经纱是有用的,这样经纱调度程序可以发出指令。



请参阅2 了解GTX480和GTX560之间的差异。



如果你阅读参考资料(几分钟),我想你会发现你的目标没有意义。我会尝试回应您的分数。



1'。如果你启动内核<<<< 8,48 >>>你将得到8块,每块有2条经纱32和16个线程。不能保证这8个经纱将被分配给不同的SM。如果2个warp被分配给SM,则每个warp调度器可以选择warp并执行warp。您将只使用48个核心中的32个。



2'。在48个线程的8个块和6个线程的64个块之间有很大的区别。让我们假设你的内核没有发散,每个线程执行10条指令。




  • 8个区块,共48个线程= 16个指令* 10个指令= 160个指令

  • 6个线程= 64个warp * 10个指令= 640个指令



为了获得最佳效率,工作划分应为32的倍数线程。硬件将不会合并来自不同经线的线程。



3'。如果内核没有最大限度地使用寄存器或共享内存,GTX560每次可以有8个SM * 8块= 64个块,或者8个SM * 48个warp = 512个warp。在任何给定的时间,一部分工作将在SM上活动。每个SM有多个执行单元(超过CUDA核心)。在任何给定时间使用哪些资源取决于应用的扭曲调度器和指令组合。如果你不做TEX操作,那么TEX单元将空闲。如果您不执行特殊浮点操作,则SUFU单元将空闲。



4'。 Parallel Nsight和Visual Profiler显示



a。执行IPC



b。发出IPC



c。每个活动周期的活动经纱



d。每个活动周期的合格经纱(仅限Nsight)



e。扭曲失速原因(Nsight only)



f。每个执行指令的活动线程



分析器不显示任何执行单元的利用百分比。对于GTX560,粗略估计为IssuedIPC / MaxIPC。
对于MaxIPC假设
GF100(GTX480)是2
GF10x(GTX560)是4,但目标是3是更好的目标。


I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread. I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.

First of all, I would like to understand if I got these facts straight:

  1. The programmer writes a kernel, and organize its execution in a grid of thread blocks.

  2. Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.

  3. Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.

  4. The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.

  5. If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.

  6. On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.

  7. If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.

Are they correct?

Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).

My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:

  1. I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?

  2. Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)

  3. If I "submerge" the CPU with work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores in the architecture will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?

  4. Is there any way to check these situations using the profiler?

  5. Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.

解决方案

Two of the best references are

  1. NVIDIA Fermi Compute Architecture Whitepaper
  2. GF104 Reviews

I'll try to answer each of your questions.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

4'. There is a mapping between laneid (threads index in a warp) and a core.

5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

6'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block.

7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

See reference 2 for differences between a GTX480 and GTX560.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 warps will be assigned to different SMs. If 2 warps are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

  • 8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
  • 64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'. Parallel Nsight and the Visual Profiler show

a. executed IPC

b. issued IPC

c. active warps per active cycle

d. eligible warps per active cycle (Nsight only)

e. warp stall reasons (Nsight only)

f. active threads per instruction executed

The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

这篇关于CUDA块/ warp /线程如何映射到CUDA核心?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆