减少CPU到GPU数据传输延迟的技术 [英] Techniques to Reduce CPU to GPU Data Transfer Latency

查看:1900
本文介绍了减少CPU到GPU数据传输延迟的技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找方法来减少由CPU和GPU来回传输数据造成的延迟。当我第一次使用CUDA时,我注意到CPU和GPU之间的数据传输需要几秒钟,但我并不在乎,因为这不是真正关心我正在写的小程序。事实上,对于绝大多数使用GPU的程序,包括视频游戏,延迟可能不是一个问题,因为它们仍然比如果它们在CPU上运行要快得多。

$然而,我是一个HPC爱好者,当我看到天和理论峰值FLOPS和实际的LINPACK测量之间的巨大差异时,我关心我的研究方向性能。这使我担心我是否正在采取正确的职业道路。



通过使用cudaHostAlloc()函数来使用固定内存(page-locked)内存是减少延迟(非常有效)的一种方法,但是有任何其他技术我不知道?要清楚,我要谈的是优化代码,而不是硬件本身(这是NVIDIA和AMD的工作)。



正如一个问题,意识到戴尔和惠普销售特斯拉服务器。我很好奇GPU如何利用数据库应用程序,在那里你将需要从硬盘驱动器(HDD或SSD),只有CPU可以执行的操作,不断读取

地址而不是 reduce ,因为您不一定需要减少延迟,如果你可以隐藏它。还要注意,我更熟悉CUDA,所以下面我只提到CUDA,但一些功能也可在OpenCL。



正如你提到页锁定内存具有增加的目的。另外,可以将页面锁定主机存储器映射到GPU,这使得能够直接访问从GPU内核分配的数据,而不需要额外的数据传输。这种机制称为零拷贝传输,如果数据只有一次读/写,伴随着大量的计算,并且对于没有单独内存(移动)的GPU,这是非常有用的。然而,如果访问零复制数据的内核没有强烈的计算限制,因此数据访问的延迟不能被隐藏,则页锁定而不是映射的存储器将更有效。此外,如果数据不适合GPU内存,零拷贝仍然可以工作。

请注意,过多的页锁定内存可能会导致CPU端严重减速。



从tkerwin提到的不同角度来看,异步传输(与CPU交谈的CPU线程)是关键隐藏CPU-GPU传输延迟,通过在传输上重叠CPU计算。这可以通过 cudaMemcpyAsync()以及使用异步内核执行的零拷贝来实现。

可以通过使用多个流以与内核执行重叠传输。注意,流调度可能需要特别注意良好的重叠;特斯拉和Quadro卡具有双DMA引擎,可以同时向GPU传输和从GPU传输数据。
此外,使用CUDA 4.0,从多个CPU线程使用GPU变得更容易,因此在多线程CPU代码中,每个线程都可以将自己的数据发送到GPU,并更容易启动内核。



最后, GMAC 实施了一种非对称共享内存模型CUDA。其一个非常有趣的功能是它提供的一致性模型,特别是延迟更新和滚动更新,使得只能以阻塞的方式传输在CPU上修改的数据。

有关详细信息,请参阅以下文档: Gelado et al。 - 非对称分布式共享内存
异构并行系统模型


I've been looking into ways to reduce the latency caused by transferring data back and forth from the CPU and GPU. When I first started using CUDA I did notice that data transfer between the CPU and GPU did take a few seconds, but I didn't really care because this isn't really a concern for the small programs I'm been writing. In fact, the latency probably isn't much of a problem for vast majority of the programs that utilize GPUs, video games included, because they're still a lot faster than if they would have run on the CPU.

However, I'm a bit of an HPC enthusiast and I became concerned with the direction of my studies when I saw the massive discrepancy between the Tianhe-I theoretical peak FLOPS and the actual LINPACK measured performance. This has raised my concerns about whether I'm taking the right career path.

Use of pinned memory (page-locked) memory through the use of the cudaHostAlloc() function is one method of reducing latency (quite effective), but are there any other techniques I'm not aware of? And to be clear, I'm talking about optimizing the code, not the hardware itself (that's NVIDIA and AMD's jobs).

Just as a side question, I'm aware that Dell and HP sell Tesla servers. I'm curious as to how well a GPU leverages a database application, where you would need a constant read from the hard drive (HDD or SSD), an operation only the CPU can perform,

解决方案

There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term address instead of reduce as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL.

As you mentioned page-locked memory has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called zero-copy transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work.
Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side.

Approaching the problem from a different angle, as tkerwin mentioned, asynchronous transfer (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with cudaMemcpyAsync() as well as using zero-copy with asynchronous kernel execution.
One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU. Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier.

Finally, GMAC implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion.
For more details see the following paper: Gelado et al. - An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems.

这篇关于减少CPU到GPU数据传输延迟的技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆