帮我理解cuda [英] help me understand cuda

查看:278
本文介绍了帮我理解cuda的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些麻烦理解线程在NVIDIA gpu架构与cuda。

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.

请任何人可以澄清这些信息:
8800 gpu有16 SMs每个8个SP。所以我们有128个SP。

please could anybody clarify these info: an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.

我正在查看斯坦福的视频演示,它说,每个SP都能够运行96线程。这意味着它(SP)可以同时运行96/32 = 3条经纱?

i was viewing stanford's video presentation and it was saying that every SP is capable of running 96 threads cuncurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?

此外,由于每个SP可以运行96个线程, 。这是否意味着每个SM可以运行96 * 8 = 768个线程?但是如果每个SM可以一次运行一个块,并且一个块中的最大线程数为512,那么并发运行768个线程的目的是什么,最多有512个线程?

moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?

一个更一般的问题是:如何将块,线程和warp分布到SM和SP?我读取每个SM获得一个单独的块执行一次,一个块中的线程被分成经线(32线程),并且SP执行warp。

a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.

推荐答案

您应该查看NVIDIA网站上的网络研讨会,您可以加入实时会话或查看预先记录的会话。以下是简要概述,但我强烈建议您观看网络研讨会,他们会真正帮助您,您可以看到图表,并同时解释。

You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.

当您在GPU上执行作为线程网格执行的函数(内核)。

When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.


  • 是最细粒度的,每个线程在块(threadIdx)中都有唯一的标识符,用于选择要操作的数据上。线程可以具有相对大量的寄存器,并且还具有称为本地存储器的专用存储区域,其用于寄存器文件溢出和任何大的自动变量。

  • 是一组批量执行的线程组。这种粒度级别的主要原因是块内的线程可以通过使用快速共享存储器进行通信来协作。每个块都有一个唯一标识符(blockIdx),与threadIdx一起用于选择数据。

  • 是一组块执行GPU操作。

  • A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
  • A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
  • A grid is a set of blocks which together execute the GPU operation.

这是逻辑层次结构。你真的只需要理解逻辑层次以在GPU上实现一个功能,然而获得性能你需要了解的硬件也是SM和SP。

That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.

A GPU由SM组成,每个SM包含多个SP。目前每个SM有8个SP,每个GPU有1到30个SM,但真正的实际数量并不是一个主要问题,直到你变得非常先进。

A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.

第一点要考虑的性能是 warps 。 warp是一组32个线程(如果你在一个块中有128个线程(例如),则线程0-31将在一个warp中,32-63在下一个warp等等。warp是非常重要的几个原因,最重要的是:

The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:


  • 一个warp内的线程绑定在一起,如果一个线程中的一个线程沿着'if'一个if-else块和其他人下来的else,然后实际上所有32个线程将下降两侧。从功能上没有问题,那些不应该采取分支的线程被禁用,所以你总是得到正确的

  • 在一个warp中的线程(实际上是一个半经线,但如果你适合的经线,那么你是安全的在下一代)从内存中提取数据在一起,所以如果你可以确保所有线程提取相同的段内的数据,那么你只会支付一个内存事务,如果他们都从随机地址提取,那么你将支付32内存事务。有关详情,请参阅高级CUDA C 演示文稿,但前提是您

  • 在一个warp内(在当前GPU上再次半剪切)访问共享内存,如果你不小心,你会有'bank conflict',线程必须

  • Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
  • Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
  • Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.

因此,在了解了弯曲的情况下,最后一点是块和网格

So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.

每个块将从一个SM开始,并保持在那里直到完成。一旦完成,它将退出,并可以在SM上启动另一个块。正是这种动态调度赋予GPU可扩展性 - 如果你有一个SM,则所有块在一个大队列上的同一个SM上运行,如果你有30个SM,那么这些块将被动态地调度。因此,您应该确保在启动GPU功能时,网格由大量块(至少数百个)组成,以确保它可以跨越任何GPU扩展。

Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.

最后一点是,SM可以在任何给定时间执行多个块。这解释了为什么SM可以处理768线程(或在一些GPU中更多),而一个块只有512线程(目前)。本质上,如果SM具有可用的资源(寄存器和共享内存),则它将占用额外的块(最多8个)。占位计算器电子表格(包含在SDK中)将帮助您确定随时可以执行多少块。

The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.

对于大脑转储,观看网络研讨会 - 会更容易!

Sorry for the brain dump, watch the webinars - it'll be easier!

这篇关于帮我理解cuda的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆