为什么 CUDA 固定内存这么快? [英] Why is CUDA pinned memory so fast?

查看:25
本文介绍了为什么 CUDA 固定内存这么快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用固定内存进行 CUDA 数据传输时,我观察到数据传输速度显着加快.在 linux 上,实现这一点的底层系统调用是 mlock.从 mlock 的手册页中,它指出锁定页面可以防止它被换出:

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:

mlock() 锁定地址范围内的页面,从 addr 开始并持续 len 个字节.调用成功返回时,保证所有包含指定地址范围的一部分的页面都驻留在 RAM 中;

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;

在我的测试中,我的系统上有几场空闲内存,所以从来没有任何内存页面可能被换出的风险,但我仍然观察到了加速.任何人都可以解释这里到底发生了什么吗?任何见解或信息都非常感谢.

In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

推荐答案

CUDA Driver 检查,如果内存范围被锁定,那么它将使用不同的代码路径.锁定的内存存储在物理内存 (RAM) 中,因此设备可以在没有 CPU 帮助的情况下获取它(DMA,也称为异步副本;设备只需要物理页面列表).非锁定内存在访问时会产生页面错误,并且它不仅存储在内存中(例如它可以在交换中),因此驱动程序需要访问非锁定内存的每一页,将其复制到固定缓冲区并传递到 DMA(同步、逐页复制).

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

如此处所述http://forums.nvidia.com/index.php?showtopic=164661

异步 ​​mem 复制调用使用的主机内存需要通过 cudaMallocHost 或 cudaHostAlloc 进行页面锁定.

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

我还可以推荐在 developer.download.nvidia.com 上查看 cudaMemcpyAsync 和 cudaHostAlloc 手册.HostAlloc 表示 cuda 驱动程序可以检测固定内存:

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

驱动程序跟踪使用 this(cudaHostAlloc) 函数分配的虚拟内存范围,并自动加速对 cudaMemcpy() 等函数的调用.

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

这篇关于为什么 CUDA 固定内存这么快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆