CUDA:总共有多少并发线程? [英] CUDA: How many concurrent threads in total?

查看:1235
本文介绍了CUDA:总共有多少并发线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个GeForce GTX 580,我想说明的线程总数(理想情况下)实际上与此并行运行,以与2或4与多核CPU进行比较。 p>

deviceQuery提供以下可能的相关信息:



CUDA功能主要/次要版本号:2.0



(16)多处理器x(32)CUDA内核/ MP:512 CUDA内核



每个块的最大线程数:1024



我想我听说每个CUDA核心可以运行一个warp i并行,并且warp是32个线程。这是正确的说,卡可以运行512 * 32 = 16384线程并行然后,或者我的方式关闭和CUDA核心不知何故没有真正运行我并行?

解决方案

GTX 580可以同时运行16 * 48个并发经线(每个线程32个线程)。这是16个多处理器(SM)*每个SM的48个驻留经线每个线程32个线程= 24,576个线程。



不要混淆并发和吞吐量。上面的数字是其资源可以同时存储在芯片上的最大线程数 - 可以是驻留的数。在CUDA术语中,我们也称为最大占用。硬件在经常之间切换以帮助覆盖或隐藏存储器访问的(大)等待时间以及算术管线的(小)等待时间。



虽然每个SM可以有48个驻留经线,但它只能从一个小数字发出指令(对于GTX 580平均为1到2之间,但这取决于程序指令混合)在每个时钟周期。



因此,比较吞吐量可能更好,这由可用的执行单元和硬件如何执行多个问题决定。在GTX580上,有512个FMA执行单元,但也有整数单元,特殊功能单元,存储器指令单元等,它们可以以各种组合双重发布(即同时从2条经线发出独立指令)。



考虑到上述所有问题,太难了,因此大多数人对两个指标进行比较:


  1. 峰值GFLOP / s(对于GTX 580是512 FMA单位*每个FMA * 2个触发器1544e6个周期/秒= 1581.1 GFLOP / s(单精度))

  2. 测量您感兴趣的应用程序的吞吐量。

最重要的比较总是测量挂钟时间一个真正的应用程序。


I have a GeForce GTX 580, and i want to make a statement about the total number of threads that can (ideally) actually be run in parallel with this, to compare with 2 or 4 with multi-core CPU's.

deviceQuery gives me the following possibly relevant information:

CUDA Capability Major/Minor version number: 2.0

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores

Maximum number of threads per block: 1024

I think i heard that each CUDA core can run a warp i parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am i way off and the CUDA cores are somehow not really running i parallel?

解决方案

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.

Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.

While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.

So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.

Taking into account all of the above is too difficult, though, so most people compare on two metrics:

  1. Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
  2. Measured throughput on the application you are interested in.

The most important comparison is always measured wall-clock time on a real application.

这篇关于CUDA:总共有多少并发线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆