cuda共享内存和块执行计划 [英] cuda shared memory and block execution scheduling

查看:142
本文介绍了cuda共享内存和块执行计划的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据每个块使用的共享内存量来清除 CUDA共享内存块执行的执行状态.

I would like to clear up an execution state with CUDA shared memory and block execution based on the amount of shared memory used per block.

我的目标是GTX480 nvidia卡,该卡每块具有48KB 共享内存,并具有15个流式多处理器.因此,如果我声明一个有15个块的内核,则每个块都使用48KB共享内存,并且没有达到其他限制(寄存器,每个块的最大线程数等),每个块都运行到一个SM(共15个)中,直至结束.在这种情况下,只需要在相同块的扭曲之间进行调度即可.

I target on GTX480 nvidia card which has 48KB shared memory per block and 15 streaming multiprocessors. So, if i declare a kernel with 15 blocks, each one uses 48KB of shared memory and no other restriction is reached (registers, maximum threads per block etc.) every block is running into one SM(of 15) until the end. In this case is needed only scheduling between warps of the same block.

所以,我的误解是:
我将内核称为30个块,以便每个SM上驻留2个块.现在,每个SM上的 scheduler 必须处理来自不同块的扭曲.但是仅当一个块完成执行时,另一个块的扭曲才会在SM上执行,这是因为共享内存总量(每个SM 48KB)的使用.如果没有发生这种情况,并且不同块的扭曲调度在同一SM上执行,那么结果可能是错误的,因为一个块可以读取从另一个块中加载的值到共享内存中.我说的对吗?

So, my misunderstanding scenario is:
I call a kernel with 30 blocks so that 2 blocks reside on each SM. Now scheduler on each SM have to deal with warps from different blocks. But only when one block finishes its execution, warps of the other block is executed on SM because of shared memory entire amount (48KB per SM) usage. If this doesn't happen and warps of different blocks scheduling for execution on the same SM the result may be wrong because one block can read values loaded from the other in shared memory. Am i right?

推荐答案

您无需为此担心.正确地说,如果由于使用的共享内存量而使每个SM只能容纳一个块,则任何时候都只能调度一个块.因此,不会由于过量使用共享内存而导致内存损坏.

You don't need to worry about this. As you have correctly said, if only one block fits per SM because of the amount of shared memory used, only one block will be scheduled at any one time. So there is no chance of memory corruption caused by overcommitting shared memory.

由于性能原因,BTW最好每个SM至少运行两个块,因为

BTW for performance reasons it is usually better to have at least two blocks running per SM because

  • 在__syncthreads()期间,SM可能会不必要地空闲,因为越来越少的扭曲可能仍可运行.
  • 同一块的扭曲往往紧密耦合,因此有时所有扭曲都在等​​待内存,而有时则所有扭曲都执行计算.有了更多的块,这甚至可能更好,从而总体上提高了资源利用率.
  • during __syncthreads() the SM may idle unnecessary as fewer and fewer warps from the block may still be runnable.
  • warps of the same block tend to run tightly coupled, so there may be times when all warps wait for memory and other times when all warps perform computations. With more blocks this may even out better, resulting in better ressource utilization overall.

当然,可能有原因为什么每个块更多的共享内存比每个SM上运行多个块能提供更大的加速.

Of course there may be reasons why more shared memory per block gives a larger speedup than running multiple blocks per SM would.

这篇关于cuda共享内存和块执行计划的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆