CUDA中的持久缓冲区 [英] Persistent buffers in CUDA

查看:307
本文介绍了CUDA中的持久缓冲区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个需要分配和维护持久性缓冲区的应用程序,该缓冲区可被CUDA中多个内核的连续启动使用。我最终将需要将该缓冲区的内容复制回主机。



我想声明一个全局作用域设备符号,该符号可以直接在不同的内核中使用,而不必作为显式内核参数传递,例如

  __ device__ char *缓冲区; 

但是我不确定如何分配内存并将地址分配给该设备指针,以便内存具有我需要的持久范围。所以我的问题实际上分为两部分:


  1. 分配全局内存的各种方法的寿命是多少?

  2. 应如何分配内存并为全局范围指针分配一个值? $ b是否需要使用设备代码 malloc 并运行安装程序内核来执行此操作,还是可以使用主机端API的某种组合来实现此目的?

[后记:此问题已作为问答发布,以回应此关于类似主题的早期SO问题]

解决方案


什么是


所有全局内存分配都有其分配上下文的生存期。这意味着根据您的定义,应用程序分配的所有全局内存都是永久性的,无论您是在GPU运行时堆上使用主机端API还是设备端分配。


应如何分配内存并为全局范围
指针分配一个值?是否有必要使用设备代码 malloc 并运行安装程序
内核来执行此操作,还是可以将主机端API的某些组合用于
来实现


这两种方法都可以按照您的要求工作,尽管主机API使用起来更简单。两种方法之间也有一些重要的区别。



在设备代码中使用 malloc new 进行内存分配分配在设备运行时堆上。必须使用 在设备代码中运行 malloc 之前先使用cudaDeviceSetLimit API ,否则调用可能会失败。而且,主机端内存管理API无法访问设备堆,因此,在将内容传输回主机之前,还需要复制内核将内存内容传输到主机API可访问内存。 / p>

相反,主机API情况非常简单,没有设备端 malloc 的限制。一个简单的示例如下所示:

  __ device__ char *缓冲区; 

int main()
{
char * d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);

//分配内存
cudaMalloc(& d_buffer,buffer_sz);

//零内存并分配给全局设备符号
cudaMemset(d_buffer,0,buffer_sz);
cudaMemcpyToSymbol(buffer,& d_buffer,sizeof(char *));

//内核使用缓冲区

//复制到主机
std :: vector< char>结果(800 * 600);
cudaMemcpy(& results [0],d_buffer,buffer_sz,cudaMemcpyDeviceToHost);

//缓冲区具有寿命,直到此处可用为止
cudaFree(d_buffer);

返回0;
};

[标准免责声明:在浏览器中编写的代码,未经编译或测试,使用风险自负]



因此,基本上,您可以使用标准主机端API来实现所需的功能: cudaMalloc cudaMemcpyToSymbol cudaMemcpy 。不需要什么。


I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.

I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like

__device__ char* buffer;

but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:

  1. What is the lifetime of the various methods of allocating global memory?
  2. How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?

[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]

解决方案

What is the lifetime of the various methods of allocating global memory?

All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.

How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?

Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.

Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.

The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:

__device__ char* buffer;

int main()  
{
    char* d_buffer;
    const size_t buffer_sz = 800 * 600 * sizeof(char);

    // Allocate memory
    cudaMalloc(&d_buffer, buffer_sz);

    // Zero memory and assign to global device symbol
    cudaMemset(d_buffer, 0, buffer_sz);
    cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));

    // Kernels go here using buffer

    // copy to host
    std::vector<char> results(800*600);
    cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);

    // buffer has lifespan until free'd here
    cudaFree(d_buffer);

    return 0;
  };

[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]

So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

这篇关于CUDA中的持久缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆