CUDA零复制内存注意事项 [英] CUDA Zero Copy memory considerations
问题描述
我试图找出是否使用cudaHostAlloc(或cudaMallocHost?)是适当的。
I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
我试图运行一个内核,其中我的输入数据超过
I am trying to run a kernel where my input data is more than the amount available on the GPU.
我可以cudaMallocHost比GPU上更多的空间吗?如果不是,让我说,我分配1/4的空间,我需要(这将适合在GPU上),使用固定内存有任何优势吗?
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
我将基本上都仍然是1/4大小的缓冲区拷贝到我的全尺寸malloc分配缓冲区而这可能不会高于仅仅使用普通cudaMalloc吧?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
这种典型的使用场景对于使用cudaMallocHost是正确的:
Is this typical usage scenario correct for using cudaMallocHost:
- 分配固定主机内存(允许称为h_p)
- 使用输入数据填充h_p
- 获取GPU上的设备指针以查找h_p
- 使用该设备指针运行内核以修改数组的内容
- 使用h_p像正常,现在已修改内容 -
- allocate pinned host memory (lets call it "h_p")
- populate h_p with input data-
- get device pointer on GPU for h_p
- run kernel using that device pointer to modify contents of array-
- use h_p like normal, which now has modified contents-
所以 - 没有副本必须在步骤4和5之间开心吗?
So - no copy has to happy between step 4 and 5 right?
如果这是正确的,
推荐答案
内存传输是一个重要因素,当涉及到性能的CUDA应用程序。 cudaMallocHost
可以做两件事:
Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost
can do two things:
- 分配固定内存:这是页面锁定的主机内存,运行时可以跟踪。如果以这种方式分配的主机内存涉及作为源或目标的
cudaMemcpy
,CUDA运行时将能够执行优化的内存传输。 - 分配映射内存:这也是页面锁定内存,可直接在内核代码中使用,因为它被映射到CUDA地址空间。为此,您必须使用 cudaDeviceMapHost 标志/online/group__CUDART__DEVICE_gd986d35e3525da7f0fba505f2eb0fab6.htmlrel =nofollow>
cudaSetDeviceFlags
,然后再使用任何其他CUDA函数。 GPU内存大小不限制映射主机内存的大小。
- allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in
cudaMemcpy
as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer. - allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the
cudaDeviceMapHost
flag usingcudaSetDeviceFlags
before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.
我不确定后一种技术的性能。
I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.
如果你访问内核中的内存块(即你不需要整个数据,但只需要一个部分),你可以使用多缓冲方法利用内存异步传输带的 cudaMemcpyAsync
:在一个缓冲区上计算,将一个缓冲区传输到主机,同一时间。
If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync
by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.
我相信,当使用 cudaDeviceMapHost
分配类型时,对于使用场景的断言是正确的。你不必做一个明确的副本,但肯定会有一个你看不到的隐式副本。有一个机会,它很好地与您的计算重叠。请注意,您可能需要同步内核调用以确保内核完成,并且您已在h_p中修改了内容。
I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost
type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.
这篇关于CUDA零复制内存注意事项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!