流多处理器,核心每流处理器在cuda [英] stream multiprocessor, core per streamprocessor in cuda

查看:1409
本文介绍了流多处理器,核心每流处理器在cuda的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与不同的nvidia图形卡集合,它具有不同的规格,具有不同数量的流多处理器和每个处理器在每个流处理器中具有不同数量的核。



根据设备的容量,线程块被分配给单个处理器,如
1块32个经线或2个16个线程块。



但我可以不了解每个流处理器中的核心数。 每个流处理器中具有更大数量核心的设备的重要性



我想我们需要更好地利用设备属性更好的优化



实际上cuda程序如何在设备中与流处理器和每个流处理器的核心相关在每个流处理器中具有较大数量内核的设备的重要性是什么?

$ b $ b

每个SM的核心数量大致转换为在任何给定时钟周期内可以处理多少个warp指令。可以在任何给定的时钟周期中处理单个弯曲指令,但是需要32个内核完成(并且可能需要多个时钟周期来完成,这取决于指令)。具有32个内核的cc2.0 fermi SM可以每个时钟最多退出1个指令,平均(每2个时钟实际上有2个指令)。具有192个内核的Kepler SMX可以每个时钟丢弃4个或更多指令。有关更精确的答案,请参阅编程指南的计算能力体系结构部分,并注意每个计算功能都有一个部分 1.0 2.0 3.0


实际上cuda程序如何在设备中流动关于流处理器和核心每流处理器?? /


这个问题在CUDA标签上已经回答了很多次。与内核启动相关联的网格中的每个线程块被分配给一个SM(当SM具有空闲槽时)。然后,SM将线程块解包为warp,并且在这些资源可用时调度SM内部资源(例如核心和特殊功能单元)上的warp指令。


with the different set of nvidia graphic card, it has different speciication with different number of stream multiprocessor and each processor with different number of cores in each stream processor.

The thread blocks are assigned to a single processor according to the capacity of device like 1 block of 32 warps or 2 blocks of 16 warps.

But I could not understand the number of cores in each stream processor. what is the significane of device having larger number cores in each stream processor???

I suppose we need to better utilize the device properties for better optimization

Actially how the cuda program flows with in the device regarding stream processor and cores per stream processor??/

解决方案

what is the significane of device having larger number cores in each stream processor???

The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock. For a more precise answer, refer to the compute capabilities architecture section of the programming guide, and note that there is one section for each compute capability 1.0 2.0 3.0.

Actially how the cuda program flows with in the device regarding stream processor and cores per stream processor??/

This question has been answered many times on the CUDA tag. Each threadblock in the grid associated with a kernel launch is assigned to one SM (when the SM has a free slot). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores", and special function units), as those resources become available.

这篇关于流多处理器,核心每流处理器在cuda的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆