OpenCL零复制中的访问路径 [英] Access Path in Zero-Copy in OpenCL

查看:100
本文介绍了OpenCL零复制中的访问路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于零复制的工作方式,我有些困惑.

I am a little bit confused about how exactly zero-copy work.

1-要确认以下内容对应于opencl中的零复制.

1- Want to confirm that the following corresponds to zero-copy in opencl.

 .......................
 .           .         .  
 .           .         .
 .           . CPU     . 
 .   SYSTEM  .         .
 .    RAM    . c3 X    .  
 .         <=====>     .  
 ...|...................
   PCI-E     / /
    |       / /
 c2 |X     /PCI-E, CPU directly accessing GPU memory
    |     / /                          copy c3, c2 is avoided, indicated by X. 
 ...|...././................
 .   MEMORY<====>          .
 .   OBJECT  .c1           . 
 .           .     GPU     .
 .   GPU RAM .             .  
 .           .             .  
 ...........................




 .......................
 .           .         .  
 .           .         .
 .           .   CPU   . 
 .SYSTEM RAM .         .
 .           .         .
 .           . c3      .  
 .    MEMORY<====>     .           
 ...| OBJECT............
    |     \  \   
   PCI-E   \  \PCI-E, GPU directly accessing System memory.  copy c2, c1 is avoided
    |       \  \
 C2 |X       \  \
 ...|.........\..\...........
 .  |        .              .
 .       <=======>          . 
 .   GPU    c1 X   GPU      .
 .   RAM     .              .  
 .           .              .  
 ............................

GPU/CPU直接访问系统/GPU-RAM,没有显式复制.

The GPU/CPU is accessing System/GPU-RAM directly, without explicit copy.

2-拥有它的好处是什么? PCI-e仍在限制整个带宽. 或唯一的优点是我们可以避免复制c2& ;; c1/c3在上述情况下?

2-What is the advantage of having this? PCI-e is still limiting the over all bandwidth. Or the only advantage is that we can avoid copies c2 & c1/c3 in above situations?

推荐答案

您对零复制的工作原理是正确的.基本前提是您可以从设备访问主机内存,也可以从主机访问设备内存,而无需在两者之间进行中间缓冲.

You are correct in your understanding of how zero-copy works. The basic premise is that you can access either the host memory from the device, or the device memory from the host without needing to do an intermediate buffering step in between.

您可以通过创建带有以下标志的缓冲区来执行零复制:

You can perform zero-copy by creating buffers with the following flags:

CL_MEM_AMD_PERSISTENT_MEM //Device-Resident Memory
CL_MEM_ALLOC_HOST_PTR // Host-Resident Memory

然后,可以使用内存映射语义访问缓冲区:

Then, the buffers can be accessed using memory mapping semantics:

void* p = clEnqueueMapBuffer(queue, buffer, CL_TRUE, CL_MAP_WRITE, 0, size, 0, NULL, NULL, &err);
//Perform writes to the buffer p
err = clEnqueueUnmapMemObject(queue, buffer, p, 0, NULL, NULL);

使用零复制,您可以在执行以下操作的实现上实现性能:

Using zero-copy you could be able to achieve performance over an implementation that did the following:

  1. 将文件复制到主机缓冲区
  2. 将缓冲区复制到设备

相反,您可以一步一步完成所有操作

Instead you could do it all in one step

  1. 内存映射设备端缓冲区
  2. 将文件从主机复制到设备
  3. 取消映射内存

在某些实现中,映射和取消映射的调用可能会隐藏数据传输的成本.在我们的示例中,

On some implementations, the calls of mapping and unmapping can hide the cost of data transfer. As in our example,

  1. 内存映射设备端缓冲区[实际上创建了相同大小的主机端缓冲区]
  2. 将文件从主机复制到设备[实际上是写入主机端缓冲区]
  3. 取消映射内存[实际上是通过clEnqueueWriteBuffer将数据从主机缓冲区复制到设备缓冲区]

如果实现是以这种方式执行的,那么使用映射方法将没有任何好处.但是,AMD的OpenCL较新驱动程序允许直接写入数据,从而使映射和取消映射的成本几乎为0.对于独立显卡,请求仍然通过PCIe总线进行,因此数据传输可能很慢.

If the implementation is performing this way, then there will be no benefit to using the mapping approach. However, AMDs newer drivers for OpenCL allow the data to be written directly, making the cost of mapping and unmapping almost 0. For discrete graphics cards, the requests still take place over the PCIe bus, so data transfers can be slow.

但是,在APU架构的情况下,由于APU独特的架构,使用零拷贝语义的数据传输成本可以大大提高传输速度(如下图所示).在这种体系结构中,PCIe总线被统一北桥(UNB)取代,从而实现了更快的传输.

In the case of an APU architecture, however, the costs of data transfers using the zero-copy semantics can greatly increase the speed of transfers due to the APUs unique architecture (pictured below). In this architecture, the PCIe bus is replaced with the Unified North Bridge (UNB) that allows for faster transfers.

请注意,当在内存映射中使用零拷贝语义时,从主机读取设备侧缓冲区时,您将看到绝对可怕的带宽.这些带宽约为0.01 Gb/s,很容易成为代码的新瓶颈.

BE AWARE that when using zero-copy semantics with the memory-mapping, that you will see absolutely horrendous bandwidths when reading a device-side buffer from the host. These bandwidths are on the order of 0.01 Gb/s and can easily become a new bottleneck for your code.

对不起,如果这是太多信息.这是我的论文主题.

Sorry if this is too much information. This was my thesis topic.

这篇关于OpenCL零复制中的访问路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆