帮助我理解 cuda [英] help me understand cuda

查看:33
本文介绍了帮助我理解 cuda的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 cuda 理解 NVIDIA gpu 架构中的线程时遇到了一些麻烦.

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.

请任何人澄清这些信息:一个 8800 gpu 有 16 个 SM,每个 8 个 SP.所以我们有 128 个 SP.

please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.

我正在观看 stanford 的视频演示,它说每个 SP 都能够同时运行 96 个线程.这是否意味着它(SP)可以同时运行 96/32=3 个经线?

i was viewing stanford's video presentation and it was saying that every SP is capable of running 96 threads cuncurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?

此外,由于每个 SP 可以运行 96 个线程,我们在每个 SM 中有 8 个 SP.这是否意味着每个 SM 可以同时运行 96*8=768 个线程??但是如果每个 SM 一次只能运行一个 Block,并且一个 Block 的最大线程数是 512,那么并发运行 768 个线程,最大 512 个线程的目的是什么?

moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?

一个更普遍的问题是:块、线程和经线如何分配给 SM 和 SP?我读到每个 SM 一次执行一个块,块中的线程被划分为扭曲(32 个线程),而 SP 执行扭曲.

a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

推荐答案

您应该查看 网络研讨会 在 NVIDIA 网站上,您可以加入实时会话或查看预先录制的会话.下面是一个快速概览,但我强烈建议您观看网络研讨会,它们会很有帮助,因为您可以同时查看图表并对其进行解释.

You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.

当您在 GPU 上执行一个函数(内核)时,它会以 网格线程 的形式执行.

When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.

  • 线程是最细粒度的,每个线程在块中都有一个唯一的标识符(threadIdx),用于选择要操作的数据.线程可以拥有相对大量的寄存器,并且还有一个称为本地内存的专用内存区域,用于寄存器文件溢出和任何大型自动变量.
  • 一个是一组一起执行的线程组.这种粒度级别的主要原因是块内的线程可以通过使用快速共享内存进行通信来进行协作.每个块都有一个唯一标识符 (blockIdx),它与 ​​threadIdx 一起用于选择数据.
  • 网格是一组共同执行 GPU 操作的块.
  • A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
  • A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
  • A grid is a set of blocks which together execute the GPU operation.

这就是逻辑层次结构.您确实只需要了解逻辑层次结构即可在 GPU 上实现功能,但是为了获得性能,您还需要了解硬件,即 SM 和 SP.

That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.

一个 GPU 由 SM 组成,每个 SM 包含多个 SP.目前,每个 SM 有 8 个 SP,每个 GPU 有 1 到 30 个 SM,但实际上实际数量并不是主要问题,直到您变得非常先进.

A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.

要考虑性能的第一点是扭曲.一个经线是一组 32 个线程(如果一个块中有 128 个线程(例如),那么线程 0-31 将在一个经线中,32-63 将在下一个经线中,依此类推.由于一些原因,经线非常重要,最重要的是:

The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:

  • warp 中的线程绑定在一起,如果warp 中的一个线程沿着if-else 块的'if' 一侧向下,而其他线程沿着'else' 向下,那么实际上所有32 个线程都将向下向下.功能上没有问题,那些不应该采取分支的线程被禁用,所以你总是会得到正确的结果,但如果双方都很长,那么性能损失很重要.
  • warp 中的线程(实际上是半warp,但如果你把它弄对了,那么你在下一代上也是安全的)一起从内存中获取数据,所以如果你能确保所有线程都获取数据在同一个段"内,您只需支付一次内存交易,如果它们都从随机地址获取,那么您将支付 32 次内存交易.有关详细信息,请参阅 Advanced CUDA C 演示文稿,但仅当您准备好了!
  • warp 中的线程(在当前 GPU 上也是半warp)一起访问共享内存,如果您不小心,就会出现库冲突",线程必须在彼此后面排队才能访问内存.
  • Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
  • Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
  • Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.

在了解了扭曲是什么之后,最后一点是如何将块和网格映射到 GPU 上.

So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.

每个区块都将从一个 SM 开始,并一直保持到它完成为止.一旦完成,它就会退休,并且可以在 SM 上启动另一个块.正是这种动态调度为 GPU 提供了可扩展性——如果您有一个 SM,那么所有块都在一个大队列中的同一个 SM 上运行,如果您有 30 个 SM,那么这些块将跨 SM 动态调度.因此,您应该确保在启动 GPU 功能时,您的网格由大量块(至少数百个)组成,以确保它可以跨任何 GPU 扩展.

Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.

最后一点是,一个 SM 可以在任何给定时间执行多个块.这解释了为什么一个 SM 可以处理 768 个线程(或在某些 GPU 中更多)而一个块最多只能处理 512 个线程(当前).本质上,如果 SM 有可用的资源(寄存器和共享内存),那么它将占用额外的块(最多 8 个).Occupancy Calculator 电子表格(包含在 SDK 中)将帮助您确定随时可以执行的块数.

The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.

抱歉我的脑残,观看网络研讨会 - 会更容易!

Sorry for the brain dump, watch the webinars - it'll be easier!

这篇关于帮助我理解 cuda的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆