CUDA零复制内存注意事项 [英] CUDA Zero Copy memory considerations

查看:216
本文介绍了CUDA零复制内存注意事项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出是否使用cudaHostAlloc(或cudaMallocHost?)是适当的。

I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.

我试图运行一个内核,其中我的输入数据超过

I am trying to run a kernel where my input data is more than the amount available on the GPU.

我可以cudaMallocHost比GPU上更多的空间吗?如果不是,让我说,我分配1/4的空间,我需要(这将适合在GPU上),使用固定内存有任何优势吗?

Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?

我将基本上都仍然是1/4大小的缓冲区拷贝到我的全尺寸malloc分配缓冲区而这可能不会高于仅仅使用普通cudaMalloc吧?

I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?

这种典型的使用场景对于使用cudaMallocHost是正确的:

Is this typical usage scenario correct for using cudaMallocHost:


  1. 分配固定主机内存(允许称为h_p)

  2. 使用输入数据填充h_p

  3. 获取GPU上的设备指针以查找h_p

  4. 使用该设备指针运行内核以修改数组的内容

  5. 使用h_p像正常,现在已修改内容 -

  1. allocate pinned host memory (lets call it "h_p")
  2. populate h_p with input data-
  3. get device pointer on GPU for h_p
  4. run kernel using that device pointer to modify contents of array-
  5. use h_p like normal, which now has modified contents-

所以 - 没有副本必须在步骤4和5之间开心吗?

So - no copy has to happy between step 4 and 5 right?

如果这是正确的,

推荐答案

内存传输是一个重要因素,当涉及到性能的CUDA应用程序。 cudaMallocHost 可以做两件事:

Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost can do two things:

  • allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in cudaMemcpy as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.
  • allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the cudaDeviceMapHost flag using cudaSetDeviceFlags before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.

我不确定后一种技术的性能。

I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.

如果你访问内核中的内存块(即你不需要整个数据,但只需要一个部分),你可以使用多缓冲方法利用内存异步传输带的 cudaMemcpyAsync :在一个缓冲区上计算,将一个缓冲区传输到主机,同一时间。

If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.

我相信,当使用 cudaDeviceMapHost 分配类型时,对于使用场景的断言是正确的。你不必做一个明确的副本,但肯定会有一个你看不到的隐式副本。有一个机会,它很好地与您的计算重叠。请注意,您可能需要同步内核调用以确保内核完成,并且您已在h_p中修改了内容。

I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.

这篇关于CUDA零复制内存注意事项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆