了解GPU堆内存和驻留翘曲 [英] Understanding GPU heap memory and resident warps

查看:270
本文介绍了了解GPU堆内存和驻留翘曲的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

驻留扭曲数量是否也受用户指定的堆大小限制?

Is the number of resident warps also limited by the user-specified heap size?

例如,如果每个线程需要分配1 MB内存,并且堆被设置为总共32 MB(我假设cudaLimitMallocHeapSize用于堆使用每个内核启动而不是每个线程,是正确的吗?设备上只允许一个翘曲是真的吗?

For example, if each thread needs to allocate 1 MB memory and if the heap is set to a total of 32 MB (I'm assuming that cudaLimitMallocHeapSize is used for heap usage per kernel launch rather than per thread, is that correct?). Would it be true that only one warp is allowed on the device?

推荐答案

内核启动(或发出warp或者block)不受堆大小的限制。相反,如果不能满足发出的线程数(已达到每线程malloc,而不是相应的空闲)所请求的每个线程分配的数量,则内核启动将失败。您可以参考堆记忆分配部分的CUDA C程序员指南。每个线程分配示例代码在该部分中给出,您可以轻松地修改该代码,以证明此行为自己。只需调整堆大小和启动的线程(或块)数量,即可在达到堆限制时查看行为。是的,cudaLimitMallocHeapSize实际上用于整个设备上下文,因此它适用于所有内核启动后,相关的调用cudaDeviceSetLimit()。它不是每线程限制。还要注意,有一些分配开销。将堆大小设置为128MB并不意味着所有128MB都将可用于后续设备malloc操作。还可能提到设备malloc操作只能在CC 2.0及以上版本上使用。

The kernel launch (or issuing of warps, or blocks) will not be limited by the heap size. Instead, the kernel launch will fail, if the number of issued threads (which have reached the per-thread malloc, but not the corresponding free) times requested allocation per thread cannot be satisfied. You may wish to refer to the heap memory allocation section of the CUDA C programmers guide. A per-thread allocation sample code is given in that section, and you can easily modify that code to prove this behavior to yourself. Simply adjust the heap size and number of threads (or blocks) launched to see the behavior when the heap limit is reached. And yes, the cudaLimitMallocHeapSize is used actually for the whole device context, so it applies to all kernel launches which come after the relevant call to cudaDeviceSetLimit(). It is not a per-thread limit. Also note that there is some allocation overhead. Setting a heap size of 128MB does not mean that all 128MB will be available for subsequent device malloc operations. It may also be useful to mention that device malloc operations are only possible on CC 2.0 and above.

这篇关于了解GPU堆内存和驻留翘曲的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆