并发块如何运行单个GPU流多处理器? [英] How Concurrent blocks can run a single GPU streaming multiprocessor?

查看:451
本文介绍了并发块如何运行单个GPU流多处理器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究CUDA编程结构,我在学习后感觉到的是;在创建块和线程之后,将每个块分配给流多处理器中的每一个(例如,我使用具有14个流多处理器的GForce 560Ti,因此一次可以向所有流多处理器分配14个块)。但我正在经历几个在线材料,如这一个:



http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf



其中已经提到几个块可以在一个多处理器上并发运行。我基本上非常困惑的线程和流多处理器上的块的执行。我知道块的分配和线程的执行是绝对任意的,但我想要如何块和线程的映射实际上发生,以便并发执行可能发生。

解决方案

流式多处理器(SM)可以使用硬件多线程一次执行多个块,这个过程类似于线程线程



CUDA C编程指南在第4.2节中描述如下:


4.2硬件多线程 b

由多处理器处理的每个warp
的执行上下文(程序计数器,寄存器等)在warp的整个
生存期内保持在芯片上。因此,从一个执行上下文
切换到另一个执行上下文没有成本,并且在每个指令发布时间,warp
调度器选择具有准备好执行其下一个
指令的线程的warp



特别是,每个多处理器都有一组32位的寄存器
在经线之间进行分区,以及在线程块之间分区的并行数据高速缓存或共享
存储器。



可以驻留的块和warp数量以及对于给定内核在多处理器上一起处理
取决于
内核使用的寄存器和共享内存量以及多处理器上可用的寄存器和共享内存的
量。
每个多处理器还有最大数量的驻留块和最大
驻留warp的数量。这些限制以及多处理器
上可用的寄存器和共享存储器的
量是设备的计算能力的函数,并且在附录F中给出
。如果没有足够的寄存器或共享内存
可用于每个多处理器至少处理一个块,内核
将无法启动。



I was studying about the CUDA programming structure and what I felt after studying is that; after creating the blocks and threads, each of this blocks is assigned to each of the streaming multiprocessor (e.g. I am using GForce 560Ti which has14 streaming multiprocessors and so at one time 14 blocks can be assigned to all the streaming multiprocessors). But as I am going through several online materials such as this one :

http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf

where it has been mentioned that several blocks can be run concurrently on one multiprocessor. I am basically very much confused with the execution of the threads and the blocks on the streaming multiprocessors. I know that the assignment of blocks and the execution of the threads are absolutely arbitrary but I would like how the mapping of the blocks and the threads actually happens so that the concurrent execution could occur.

解决方案

The Streaming Multiprocessors (SM) can execute more than one block at a time using Hardware Multithreading, a process akin to Hypter-Threading.

The CUDA C Programming Guide describes this as follows in Section 4.2:

4.2 Hardware Multithreading

The execution context (program counters, registers, etc) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.

In particular, each multiprocessor has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks.

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.

这篇关于并发块如何运行单个GPU流多处理器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆