如何管理CUDA存储器? [英] How is CUDA memory managed?

查看:209
本文介绍了如何管理CUDA存储器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行我的CUDA程序,它只分配少量的全局内存(低于20 M),我有一个内存不足的错误。 (从其他人的帖子,我认为问题是与内存碎片有关)我试图理解这个问题,并实现我有一些与CUDA内存管理相关的问题。


  1. CUDA中是否有虚拟内存概念?


  2. 如果只允许一个内核同时在CUDA上运行,那么在它终止后,它使用或分配的所有内存是否会释放?如果没有,当这些内存释放了吗?


  3. 如果允许在CUDA上运行多个内核,他们如何确保它们使用的内存不重叠?


任何人都可以帮我解答这些问题吗?感谢



编辑1:操作系统:x86_64 GNU / Linux
CUDA版本:4.0
设备:Geforce 200,到机器,我不认为它是一个显示设备。



编辑2:以下是我做了一些研究后得到的。随时更正我。


  1. CUDA将为每个主机线程创建一个上下文。该上下文将保持诸如存储器的哪个部分(预分配的存储器或动态分配的存储器)已经为该应用程序保留的信息,使得其他应用程序不能写入该应用程序。当此应用程序终止(不是内核)时,这部分内存将被释放。


  2. CUDA内存由链接列表维护。当应用程序需要分配内存时,它将通过此​​链接列表来查看是否有可用于分配的连续内存块。如果它不能找到这样的块,即使总的可用存储器大小大于所请求的存储器,内存不足错误也将报告给用户。这是与内存碎片相关的问题。


  3. cuMemGetInfo会告诉你内存是多少,但不一定是多少内存,


  4. 在Vista平台(WDDM)上,GPU内存虚拟化是可能的。也就是说,多个应用程序可以分配几乎整个GPU内存,WDDM将管理将数据交换回主存储器。


新问题:
1.如果在上下文中保留的内存将在应用程序后完全释放已经终止,内存碎片不应该存在。内存中必须有某种数据。
2.是否有任何方法来重构GPU内存?

解决方案

基本上计算为

 空闲内存=总内存
- 显示驱动程序预留
- CUDA驱动程序预留
- CUDA上下文静态分配(本地内存,常量内存,设备代码)
- CUDA上下文运行时堆(内核分配,递归调用堆栈,仅在Fermi GPU上)
- CUDA上下文用户分配(全局内存,纹理)

如果你得到一个内存不足消息,前三个项目中的一个或多个在用户代码试图在GPU中获得内存之前消耗大部分GPU内存。如果,如你所指出的,你没有在显示GPU上运行,那么上下文静态分配是你的问题的最可能的来源。 CUDA通过在设备上建立上下文时预先分配上下文需要的所有存储器来工作。有很多事情被分配以支持上下文,但是上下文中的单个最大消费者是本地存储器。运行时必须 保留设备上每个多进程的上下文中的任何内核将消耗的最大量的本地内存,用于每个多处理器可以同时运行的最大线程数。如果在具有大量多处理器的设备上加载本地内存大内核,这可以运行数百Mb的内存。



最好的方法来看看可能会发生什么on是写一个没有设备代码的主机程序,它建立一个上下文并调用 cudaMemGetInfo 。这将显示设备有多少内存,其上有最小的上下文开销。然后运行你有问题的代码,在第一个 cudaMalloc 调用之前添加相同的 cudaMemGetInfo 您的上下文正在使用的内存。这可能会让你得到一个处理内存的地方。如果你在第一个 cudaMalloc 调用中失败,则这种碎片是不可能的问题。


When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.

  1. Is there a virtual memory concept in CUDA?

  2. If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?

  3. If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?

Can anyone help me answer these questions? Thanks

Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.

Edit 2: The following is what I got after doing some research. Feel free to correct me.

  1. CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.

  2. CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.

  3. cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.

  4. On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.

New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

解决方案

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory 
              - display driver reservations 
              - CUDA driver reservations
              - CUDA context static allocations (local memory, constant memory, device code)
              - CUDA context runtime heap (in kernel allocations, recursive call stack, only on Fermi GPUs)
              - CUDA context user allocations (global memory, textures)

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

这篇关于如何管理CUDA存储器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆