共享内存优化混乱 [英] shared memory optimization confusion

查看:165
本文介绍了共享内存优化混乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在cuda中写了一个应用程序,它在每个块中使用1kb的共享内存。
由于在每个SM中只有16kb的共享内存,所以只有16个块可以容纳整体(我正确理解吗?),虽然一次只能调度8个,但现在如果一些块忙在进行存储器操作时,其他块将被调度在gpu上,但是所有的共享存储器被已经在那里调度的其他16个块使用,因此cuda将不会在相同的sm上调度更多的块,除非先前分配的块完全完成?或者它会将一些块的共享内存移动到全局内存,并为其分配其他块(在这种情况下我们应该担心全局内存访问延迟吗?)

I have written an application in cuda , which uses 1kb of shared memory in each block. Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)

推荐答案

它不工作这样。将在单个SM上在任何给定时刻计划运行的块的数量将始终是以下的最小值:

It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:


  1. 8块

  2. 静态和动态分配的共享内存之和小于16kb或48kb的块数,具体取决于GPU体系结构和设置。还有共享内存页大小限制,这意味着每个块分配被取整为页大小的下一个最大倍数。

  3. 每个块寄存器使用总和小于的块数8192/16384/32678。还有注册文件页面大小,这意味着每个块的分配被舍入到页面大小的下一个最大倍数。

是所有有它。没有共享内存的分页来容纳更多的块。 NVIDIA制作了一个用于计算占用的电子表格,它随工具包一起提供并作为单独的下载。您可以在其包含的公式中查看确切的规则。它们也在CUDA编程指南的第4.2节中讨论。

That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

这篇关于共享内存优化混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆