CUDA核心VS线程数 [英] CUDA cores vs thread count

查看：1968 发布时间：2016/5/28 11:13:27 architecture cuda hardware

本文介绍了CUDA核心VS线程数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对内核的一个NVIDIA GPU的数量，开关电源的数量，和最大线程数之间的关系混为一谈。我的笔记本电脑的GT650M设备属性显示384芯，2的SMP，每个SMP 1024个线程。

I am confused by the relationship between the number of cores in an NVidia GPU, number of SMPs, and the max thread count. The device properties for my laptop's GT650m show 384 cores, 2 SMPs, with 1024 threads per SMP.

如何这些数字相互关联和经纱的大小？我认为（也许不正确地）有每个SMP 192核心，但是如果每个核心运行32个线程的经线，这不是1024的一个因素，我希望每个SMP 32 * 192线程，或2 ^ 5 *（2 ^ 7 + 2的6次方），或4096 + 2048 = 6142.

How are these numbers related to each other and warp size? I assume (perhaps incorrectly) that there are 192 cores per SMP, but that's not a factor of 1024. If each core runs a warp of 32 threads, I would expect 32 * 192 threads per SMP, or 2^5 * (2^7 + 2^6), or 4096 + 2048 = 6142.

我在想什么？

推荐答案

我想你应该有一个更深入地了解调度内核在CUDA。

I think you should have a deeper look into scheduling kernels in cuda.

有两个重要的尺寸为块和每个块的线程

There are two important sizes: blocks and threads per block

每个阻止定在一个SM，是有再切成经纱。
因此块具有共享内存这只是块内访问，
因为它位于SM存储。块每个SM取决于设备的限制和占有率计算的数量。每个SM的最大块是8 CC 1.0-2.x和16 CC 3.x的。

Each block is scheduled on one SM and is there then sliced into warps. Therefore blocks have a shared memory which is only accessible inside the block, because it lies on the SM memory. The number of blocks per SM depends on the device limit and occupancy calculation. Maximum blocks per SM is 8 for CC 1.0-2.x and 16 for CC 3.x.

每个阻止都有一定数量的每块线程。该线被分成
入经纱和经纱可以以任意的顺序来运行仅由warp-确定
调度器的SM。

Each block has a certain number of threads per block. The threads are divided into warps and the warps can be run in an arbitrary order only determined by the warp- scheduler an the SM.

现在你的卡有2个短每192核384核的总数。在CUDA核心数量再度presents单precision的浮动，可以每个周期执行点或整数线程的指令总数。不考虑任何计算CUDA核心。

Now your Card has a total Number of 384 cores on 2 SMs with 192 cores each. The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle. Do not consider CUDA cores in any calculation.

线程的最大数量每个计算能力的不同而不同。 CC2.0-3.x最高每块给予足够的寄存器和经1024插槽支持线程。经纱静态分配给经调度。每SM经调度数为1用于CC 1.x中，2用于CC 2.x和4用于CC 3.x的

The maximum number of threads varies per compute capability. CC2.0-3.x support a maximum of 1024 threads per block given sufficient registers and warp slots. Warps are statically assigned to warp schedulers. The number of warp schedulers per SM is 1 for CC 1.x, 2 for CC 2.x, and 4 for CC 3.x.

如果您的应用程序不执行并发内核即可使用SM模块的每个SM的gridDim应该有> =号。

If your application does not executed concurrent kernels then to use each SM the gridDim should have >= number of SM blocks.

有关GTX650m充分利用你的计算能力，你应该至少有两个块（否则一个
阻止你只能使用一个SM）。在另一方面，如果你要安排10240线程，你可以轻松地安排的1024线10块每一个。

For GTX650m to fully use your compute-power you should have at least two blocks (otherwise with one block you could only use one SM). On the other hand if you want to schedule 10240 threads you could easily schedule 10 block of 1024 threads each.

这篇关于CUDA核心VS线程数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA核心VS线程数 [英] CUDA cores vs thread count

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CUDA核心VS线程数 [英] CUDA cores vs thread count

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭