在CUDA中,为什么cudaMemcpy2D和cudaMallocPitch消耗大量时间 [英] In CUDA, why cudaMemcpy2D and cudaMallocPitch consume a lot of time

查看:440
本文介绍了在CUDA中,为什么cudaMemcpy2D和cudaMallocPitch消耗大量时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如标题中所述,我发现 cudaMallocPitch()的功能消耗大量时间,而 cudaMemcpy2D()的消耗也相当长的时间.

As mentioned in title, I found that the function of cudaMallocPitch() consumes a lot of time and cudaMemcpy2D() consumes quite some time as well.

这是我正在使用的代码:

Here is the code I am using:

cudaMallocPitch((void **)(&SrcDst), &DeviceStride, Size.width * sizeof(float), Size.height);

cudaMemcpy2D(SrcDst, DeviceStride * sizeof(float), 
        ImgF1, StrideF * sizeof(float), 
        Size.width * sizeof(float), Size.height,
        cudaMemcpyHostToDevice);

在实现中, Size.width Size.height 均为4800. cudaMallocPitch()的耗时约为150-160ms(万一发生意外,需要进行多次测试)和 cudaMemcpy2D()消耗大约50ms.

In implementation, the Size.width and Size.height are both 4800. The time consuming for cudaMallocPitch() is about 150-160ms (multiple tests in case accidents) and cudaMemcpy2D() consumes about 50ms.

CPU和GPU之间的内存带宽似乎不可能如此有限,但是我看不到代码中的任何错误,那是什么原因?

It seems not possible that the memory bandwidth between the CPU and GPU is so limited, but I cannot see any errors in code, so what is the reason?

顺便说一句,我使用的硬件是Intel I7-4770K CPU和Nvidia Geforce GTX 780(相当不错的硬件,没有错误).

By the way, the hardware I am using are Intel I7-4770K CPU and Nvidia Geforce GTX 780(quite good hardware without error).

推荐答案

此处有许多因素可能会影响性能.

There are many factors here which may be impacting performance.

关于 cudaMallocPitch ,如果它恰好是程序中的第一个cuda调用,则会产生额外的开销.

Regarding cudaMallocPitch, if it happens to be the first cuda call in your program, it will incur additional overhead.

关于 cudaMemcpy2D ,这是在后台通过一系列单独的memcpy操作(在2D区域的每一行进行一次操作)来完成的(即4800个单独的DMA操作).与普通的 cudaMemcpy 操作(在单个DMA传输中传输整个数据区域)相比,这必然会产生额外的开销.此外,仅在固定主机侧存储缓冲区时才能达到峰值传输速度.最后,您无需说明任何有关平台的信息.如果您在Windows上,则WDDM将干扰此操作的完整传输性能,并且我们不知道您使用的是哪种PCIE链接.

Regarding cudaMemcpy2D, this is accomplished under the hood via a sequence of individual memcpy operations, one per row of your 2D area (i.e. 4800 individual DMA operations). This will necessarily incur additional overhead compared to an ordinary cudaMemcpy operation (which transfers the entire data area in a single DMA transfer). Furthermore, peak transfer speeds are only achieved when the host side memory buffer is pinned. Finally, you don't indicate anything about your platform. If you are on windows, then WDDM will interfere with full transfer performance for this operation, and we don't know what kind of PCIE link you are on.

4800 * 4800 * 4/0.050 = 1.84GB/s,这是〜3GB/s的很大一部分,大约可用于PCIE 2.0上的非固定传输.我上面列出的其他因素很容易解释从3GB减少到1.84GB的原因.

4800*4800*4/0.050 = 1.84GB/s which is a significant fraction of the ~3GB/s that is roughly available for a non-pinned transfer across PCIE 2.0. The reduction from 3GB to 1.84GB is easily explainable by the other factors I list above.

如果要获得完整的传输性能,请使用固定内存,而不要使用倾斜的2D传输.

If you want full transfer performance, use pinned memory and don't use a pitched/2D transfer.

这篇关于在CUDA中,为什么cudaMemcpy2D和cudaMallocPitch消耗大量时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆