具有OpenCL或其他GPGPU框架的现代x86硬件上的CPU和GPU之间的数据共享 [英] Data sharing between CPU and GPU on modern x86 hardware with OpenCL or other GPGPU framework

查看:114
本文介绍了具有OpenCL或其他GPGPU框架的现代x86硬件上的CPU和GPU之间的数据共享的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

CPU和GPU硬件的逐步集成已得到证明,AMD Kaveri具有hUMA(异构统一内存访问)和Intel第四代CPU证明,应该允许CPU和GPU之间无拷贝共享数据.我想知道,最新的OpenCL(或其他GPGPU框架)实现是否允许在CPU和GPU上运行的代码之间实现真正的无复制大数据结构共享(无显式或隐式数据复制).

Progressing unification of CPU and GPU hardware, as evidenced by AMD Kaveri with hUMA (heterogeneous Uniform Memory Access) and Intel 4th generation CPUs, should allow copy-free sharing of data between CPU and GPU. I would like to know, if the most recent OpenCL (or other GPGPU framework) implementations allow true copy-free sharing (no explicit or implicit data copying) of large data structure between code running on CPU and GPU.

推荐答案

OpenCL从1.0版开始通过CL_MEM_ALLOC_HOST_PTR标志提供了在主机和设备之间共享数据而无需进行任何内存传输的功能.该标志为设备分配了一个缓冲区,但确保该缓冲区位于主机也可以访问的内存中.这些零复制"传输的工作流程通常采用以下形式:

The ability to share data between host and device without any memory transfers has been available in OpenCL from version 1.0, via the CL_MEM_ALLOC_HOST_PTR flag. This flag allocates a buffer for the device, but ensures that it lies in memory that is also accessible by the host. The workflow for these 'zero-copy' transfers usually takes on this form:

// Allocate a device buffer using host-accessible memory
d_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, &err);

// Get a host-pointer for the buffer
h_buffer = clEnqueueMapBuffer(queue, d_buffer, CL_TRUE, CL_MAP_WRITE,
                              0, size, 0, NULL, &err);

// Write data into h_buffer from the host
... 

// Unmap the memory buffer
clEnqueueUnmapMemObject(queue, d_buffer, h_buffer, 0, NULL, NULL);

// Do stuff with the buffer on the device
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_buffer);
clEnqueueNDRangeKernel(queue, kernel, ...);

这将创建一个设备缓冲区,从主机向其中写入一些数据,然后在设备上使用此缓冲区运行内核.由于缓冲区的分配方式,如果设备和主机具有统一的内存系统,则不应导致内存传输.

This will create a device buffer, write some data into it from the host, and then run a kernel using this buffer on the device. Because of the way that the buffer was allocated, this should not result in a memory transfer if the device and host have a unified memory system.

以上方法仅限于简单的平面数据结构(一维数组).如果您有兴趣使用稍微复杂一点的东西,例如链表,树或任何其他基于指针的数据结构,则需要利用 Shared Virtual Memory(SVM) OpenCL 2.0中的功能.在撰写本文时,AMD和英特尔都已经发布了对OpenCL 2.0功能的预览支持,但是我不能保证他们对SVM的实现.

The above approach is limited to simple, flat data structures (1D arrays). If you are interested in working with something a little more complex such as linked-lists, trees or any other pointer-based data structures, you'll need to take advantage of the Shared Virtual Memory (SVM) feature in OpenCL 2.0. At the time of writing, AMD and Intel have both released some preview support for OpenCL 2.0 functionality, but I cannot vouch for their implementations of SVM.

SVM方法的工作流程将与上面列出的代码有些相似.简而言之,您将使用clSVMAlloc分配缓冲区,该缓冲区将返回在主机和设备上均有效的指针.当您希望从主机访问缓冲区时,将使用clEnqueueSVMMapclEnqueueSVMUnmap同步数据,并使用clSetKernelArgSVMPointer将其传递到设备. SVM和CL_MEM_ALLOC_HOST_PTR之间的关键区别在于,SVM指针也可以包含在传递给设备的另一个缓冲区内(例如,在结构内部或由另一个指针指向).这样,您就可以构建可以在主机和设备之间共享的,基于指针的复杂数据结构.

The workflow for the SVM approach will be somewhat similar to the code listed above. In short, you will allocate a buffer using clSVMAlloc, which will return a pointer that is valid on both the host and device. You will use clEnqueueSVMMap and clEnqueueSVMUnmap to synchronise the data when you wish to access the buffer from the host, and clSetKernelArgSVMPointer to pass it to the device. The crucial difference between SVM and CL_MEM_ALLOC_HOST_PTR is that an SVM pointer can also be included inside another buffer passed to the device (e.g. inside a struct or pointed to by another pointer). This is what allows you to build complex pointer-based data structures that can be shared between the host and device.

这篇关于具有OpenCL或其他GPGPU框架的现代x86硬件上的CPU和GPU之间的数据共享的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆