CUDA中的经线和银行的机制是什么? [英] What's the mechanism of the warps and the banks in CUDA?

查看:270
本文介绍了CUDA中的经线和银行的机制是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个新手在学习CUDA并行编程。现在我困惑在设备的全局内存访问。



有以下几点:


  1. 据说,一个块中的线程被分割成经线。在每个翘曲中最多有32个线程。这意味着相同warp的所有这些线程将与同一处理器同时执行。


  2. 当涉及到一个块的共享内存时,它会被分成16个存储区。为了避免库冲突,多个线程可以同时读取一个库,而不是在同一个库中写入。这是正确的解释吗?


提前感谢!

(例如特斯拉或GT200代,
和原始的G80 / G92代)。 这些GPU的架构为
一个SM(流多处理器 - GPU内的一个HW块)
,具有少于32个线程
处理器

warp的定义仍然相同,但实际的HW
执行在半warp中执行。实际上,
粒度细节比这更复杂,但是它满足
说执行模型导致内存请求根据半转换的需要发布
,即16个线程在翘曲内。
因此,对内存事务进行完整转换会产生
,共有2个请求。



Fermi和更新的GPU每$ b $至少有32个线程处理器b SM
因此,内存事务在整个
warp中立即可见。结果,存储器请求在每转换
级别而不是每半转换发出。但是,满内存请求
一次只能检索128个字节。因此,对于每个事务每个线程的数据大小
大于32位,存储器
控制器仍然可以将请求分解成半变换大小。



我的观点是,特别是对于一个初学者,没有必要
有一个详细的了解半变形。通常
足以理解它指的是一组16个线程
一起执行,并且对内存请求有影响。


  • 共享内存费米级
    GPU b
    分成32个库。在
    之前的
    GPU上

    它被分成16个银行。在同一
    存储器请求(即源自相同的代码指令)中由多于一个线程访问
    单个存储体时,任何时候发生存储体冲突。
    为了避免银行冲突,基本策略非常类似于用于合并内存请求的
    策略,例如。用于全局内存。在费米和较新的GPU上,多个线程可以读取相同的地址而不引起存储体冲突,但是一般来说,存储体冲突的定义是多个线程从同一个银行。为了进一步了解共享内存以及如何避免银行冲突,我建议您使用 NVIDIA在线讲座此主题


  • I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.

    There are some points:

    1. It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?

    2. When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?

    Thanks in advance!

    解决方案

    1. The principal usage of "half-warp" was applied to CUDA processors prior to the Fermi generation (e.g. the "Tesla" or GT200 generation, and the original G80/G92 generation). These GPUs were architected with a SM (streaming multiprocessor -- a HW block inside the GPU) that had fewer than 32 thread processors. The definition of warp was still the same, but the actual HW execution took place in "half-warps" at a time. Actually the granular details are more complicated than this, but suffice it to say that the execution model caused memory requests to be issued according to the needs of a half-warp, i.e. 16 threads within the warp. A full warp that hit a memory transaction would thus generate a total of 2 requests for that transaction.

      Fermi and newer GPUs have at least 32 thread processors per SM. Therefore a memory transaction is immediately visible across a full warp. As a result, memory requests are issued at the per-warp level, rather than per-half-warp. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size.

      My view is that, especially for a beginner, it's not necessary to have a detailed understanding of half-warp. It's generally sufficient to understand that it refers to a group of 16 threads executing together and it has implications for memory requests.

    2. Shared memory for example on the Fermi-class GPUs is broken into 32 banks. On previous GPUs it was broken into 16 banks. Bank conflicts occur any time an individual bank is accessed by more than one thread in the same memory request (i.e. originating from the same code instruction). To avoid bank conflicts, basic strategies are very similar to the strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.

    这篇关于CUDA中的经线和银行的机制是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆