CUDA 内存是如何管理的? [英] How is CUDA memory managed?

查看:31
本文介绍了CUDA 内存是如何管理的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行仅分配少量全局内存(低于 20 M)的 CUDA 程序时,出现内存不足"错误.(从其他人的帖子中,我认为问题与内存碎片有关)我试图理解这个问题,并意识到我有几个与 CUDA 内存管理相关的问题.

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.

  1. CUDA 中有虚拟内存的概念吗?

  1. Is there a virtual memory concept in CUDA?

如果只允许一个内核同时在 CUDA 上运行,在它终止后,它使用或分配的所有内存是否都会释放?如果没有,这些内存何时释放?

If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?

如果允许在 CUDA 上运行多个内核,它们如何确保它们使用的内存不重叠?

If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?

谁能帮我回答这些问题?谢谢

Can anyone help me answer these questions? Thanks

操作系统:x86_64 GNU/LinuxCUDA 版本:4.0设备:Geforce 200,它是机器附带的GPU之一,我认为它不是显示设备.

Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.

以下是我做了一些研究后得到的.请随时纠正我.

Edit 2: The following is what I got after doing some research. Feel free to correct me.

  1. CUDA 将为每个主机线程创建一个上下文.此上下文将保留诸如已为该应用程序保留的内存部分(预分配的内存或动态分配的内存)等信息,以便其他应用程序无法写入.当这个应用程序(不是内核)终止时,这部分内存将被释放.

  1. CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.

CUDA 内存由链接列表维护.当应用程序需要分配内存时,它会通过这个链表查看是否有连续的内存块可供分配.如果找不到这样的块,即使总可用内存大小大于请求的内存,也会向用户报告内存不足"错误.这就是与内存碎片有关的问题.

CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.

cuMemGetInfo 会告诉您有多少可用内存,但不一定会告诉您由于内存碎片,您可以在最大分配中分配多少内存.

cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.

在 Vista 平台 (WDDM) 上,GPU 内存虚拟化是可能的.也就是说,多个应用程序可以分配几乎整个 GPU 内存,而 WDDM 将管理将数据交换回主内存.

On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.

新问题:1. 如果上下文中保留的内存在应用程序终止后会被完全释放,则不应该存在内存碎片.内存中一定还有一些数据.2. 有没有办法重构GPU内存?

New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

推荐答案

你的代码在运行时可用的设备内存基本上是这样计算的

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory 
              - display driver reservations 
              - CUDA driver reservations
              - CUDA context static allocations (local memory, constant memory, device code)
              - CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
              - CUDA context user allocations (global memory, textures)

如果您收到内存不足消息,那么在您的用户代码尝试在 GPU 中获取内存之前,前三项中的一项或多项可能正在消耗大部分 GPU 内存.如果,如您所指出的,您没有在显示 GPU 上运行,那么上下文静态分配最有可能是您的问题的根源.CUDA 通过在设备上建立上下文时预先分配上下文所需的所有内存来工作.有很多东西被分配来支持上下文,但上下文中最大的消费者是本地内存.对于设备上的每个多进程,运行时必须为每个多处理器可以同时运行的最大线程数保留上下文中的任何内核将消耗的最大本地内存量.如果在具有大量多处理器的设备上加载本地内存繁重的内核,这可能会占用数百 Mb 的内存.

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

查看可能发生的情况的最佳方法是编写一个没有设备代码的主机程序,该程序建立一个上下文并调用 cudaMemGetInfo.这将向您显示设备具有多少内存,并且其上的上下文开销最小.然后运行有问题的代码,在第一个 cudaMalloc 调用之前添加相同的 cudaMemGetInfo 调用,这将为您提供上下文正在使用的内存量.这可能让您了解内存的去向.如果您在第一次 cudaMalloc 调用中失败,那么碎片不太可能成为问题.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

这篇关于CUDA 内存是如何管理的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆