CUDA:每个多处理器的线程数和每个块的线程数的区别是什么? [英] CUDA: What is the threads per multiprocessor and threads per block distinction?

查看:30
本文介绍了CUDA:每个多处理器的线程数和每个块的线程数的区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个安装了两个 Nvidia Quadro FX 5800 卡的工作站.运行 deviceQuery CUDA 示例显示每个多处理器 (SM) 的最大线程数为 1024,而每个块的最大线程数为 512.

We have a workstation with two Nvidia Quadro FX 5800 cards installed. Running the deviceQuery CUDA sample reveals that the maximum threads per multiprocessor (SM) is 1024, while the maximum threads per block is 512.

鉴于每个 SM 一次只能执行一个块,为什么最大线程/处理器是最大线程/块的两倍?我们如何利用每个 SM 的其他 512 个线程?

Given that only one block can be executed on each SM at a time, why is max threads / processor double the max threads / block? How do we utilise the other 512 threads per SM?

Device 1: "Quadro FX 5800"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  (30) Multiprocessors x (  8) CUDA Cores/MP:    240 CUDA Cores
  GPU Clock rate:                                1296 MHz (1.30 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              512-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

干杯,詹姆士.

推荐答案

鉴于每个 SM 一次只能执行一个块,

Given that only one block can be executed on each SM at a time,

这种说法根本上是错误的.除非资源冲突,并假设内核(即网格)中有足够的线程块,否则 SM 通常会分配多个线程块.

This statement is fundamentally incorrect. Barring resource conflicts, and assuming enough threadblocks in a kernel (i.e. the grid), an SM will generally have multiple threadblocks assigned to it.

执行的基本单位是warp.一个 warp 由 32 个线程组成,由一个 SM 在一个指令周期一个指令周期的基础上以锁步方式一起执行.

The basic unit of execution is the warp. A warp consists of 32 threads, executed together in lockstep by an SM, on an instruction-cycle by instruction-cycle basis.

因此,即使在单个线程块内,一个 SM 通常也会有多个正在运行"的经线.这对于让机器隐藏延迟的良好性能至关重要.

Therefore, even within a single threadblock, an SM will generally have more than a single warp "in flight". This is essential for good performance to allow the machine to hide latency.

从同一个线程块中选择warp来执行,或者从不同线程块中选择warp在概念上没有区别.SM 可以有多个驻留在其上的线程块(即,将诸如寄存器和共享内存等资源分配给每个驻留线程块),warp 调度程序将从所有驻留线程块中的所有 warp 中进行选择,以选择下一个 warp 执行任何给定的指令周期.

There is no conceptual difference between choosing warps from the same threadblock to execute, or warps from different threadblocks. SMs can have multiple threadblocks resident on them (i.e. with resources such as registers and shared memory assigned to each resident threadblock), and the warp scheduler will choose from amongst all the warps in all the resident threadblocks, to select the next warp for execution on any given instruction cycle.

因此,SM 可以驻留"更多线程,因为它可以支持多个块,即使该块最大配置了线程(在本例中为 512).我们通过驻留多个线程块来利用超过线程块限制.

Therefore, the SM has a greater number of threads that can be "resident" because it can support more than a single block, even if that block is maximally configured with threads (512, in this case). We utilize more than the threadblock limit by having multiple threadblocks resident.

您可能还想研究 GPU 程序中占用的概念.

You may also want to research the idea of occupancy in GPU programs.

这篇关于CUDA:每个多处理器的线程数和每个块的线程数的区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆