在Nvidia OpenCL环境中使用映射(零复制)内存机制的正确和最有效的方法是什么? [英] what's the correct and most efficient way to use mapped(zero-copy) memory mechanism in Nvidia OpenCL environment?

查看:86
本文介绍了在Nvidia OpenCL环境中使用映射(零复制)内存机制的正确和最有效的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Nvidia提供了一个有关如何配置主机和设备之间带宽的示例,您可以在此处找到代码: https://developer.nvidia.com/opencl (搜索带宽"). 实验是在Ubuntu 12.04 64位计算机上进行的. 我正在检查固定的内存和映射的访问模式,可以通过调用对其进行测试: ./bandwidthtest --memory = pinned --access = mapped

Nvidia has offered an example about how to profile bandwidth between Host and Device, you can find codes here: https://developer.nvidia.com/opencl (search "bandwidth"). The experiment is carried on in an Ubuntu 12.04 64-bits computer. I am inspecting pinned memory and mapped accessing mode, which can be tested by invoke: ./bandwidthtest --memory=pinned --access=mapped

关于主机到设备带宽的核心测试循环大约在736〜748行.我也在这里列出它们,并添加一些注释和上下文代码:

The core test loop on Host-to-Device bandwidth is at around line 736~748. I also list them here and add some comments and context code:

    //create a buffer cmPinnedData in host
    cmPinnedData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, &ciErrNum);

    ....(initialize cmPinnedData with some data)....

    //create a buffer in device
    cmDevData = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

    // get pointer mapped to host buffer cmPinnedData
    h_data = (unsigned char*)clEnqueueMapBuffer(cqCommandQueue, cmPinnedData, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        memcpy(dm_idata, h_data, memSize);
    }
    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

当传输大小为33.5MB时,测得的主机到设备带宽为6430.0MB/s. 通过以下方式将传输大小减小到1MB时: ./bandwidthtest-内存=固定-访问=映射-模式=范围-开始= 1000000-结束= 1000000-增量= 1000000 (如果计时器不太精确,则MEMCOPY_ITERATIONS从100更改为10000.) 报告的带宽变为12540.5MB/s.

The measured host-to-device bandwidth is 6430.0MB/s when transfer size is 33.5MB. When the transfer size is reduced to 1MB by: ./bandwidthtest --memory=pinned --access=mapped --mode=range --start=1000000 --end=1000000 --increment=1000000 (MEMCOPY_ITERATIONS is changed from 100 to 10000 in case timer is not so precise.) The reported bandwidth becomes 12540.5MB/s.

我们都知道PCI-e x16 Gen2接口的最高带宽为8000MB/s.因此,我怀疑分析方法是否存在一些问题.

We all know that the highest bandwidth of PCI-e x16 Gen2 interface is 8000MB/s. So I doubt there is some problems on the profiling method.

让我们重新获取核心配置代码:

Let's recapture the core profiling code:

    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        memcpy(dm_idata, h_data, memSize);
        //can we call kernel after memcpy? I don't think so.
    }
    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

我认为问题在于memcpy无法授予数据确实已传输到设备的权限,因为循环内没有任何显式的同步API.因此,如果我们尝试在memcpy之后调用内核,则内核可能会或可能不会获得有效数据.

I think the problem is that memcpy can't grantee that data has really been transfered into device, because there isn't any explicit synchronization API inside the loop. So If we try to call kernel after the memcpy, the kernel may or may not get valid data.

如果我们在性能分析循环中执行map和unmap操作,我认为我们可以在unmap操作之后安全地调用内核,因为此操作可确保数据安全地存储在设备中.新的代码在这里给出:

If we do map and unmap operation inside the profiling loop, I think we can call kernel safely after the unmap operation, because this operation guarantees data has been in device safely. The new code is given here:

// copy data from host to device by memcpy
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // get pointer mapped to device buffer cmDevData
    void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    memcpy(dm_idata, h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

    //we can call kernel here safely?
}

但是,如果我们使用这种新的性能分析方法,则报告的带宽将变得非常低: 915.2MB/s@block-size-33.5MB. 881.9MB/s@block-size-1MB.映射和取消映射操作的开销似乎不如零复制"声明少.

But, if we use this new profiling method, the reported bandwidth becomes very low: 915.2MB/s@block-size-33.5MB. 881.9MB/s@block-size-1MB. The overhead of map and unmap operation seems not as so little as "zero-copy" declare.

此地图取消映射甚至比2909.6MB/s@block-size-33.5MB慢得多,这是通过使用clEnqueueWriteBuffer()的常规方法获得的:

This map-unmap is even much slower than 2909.6MB/s@block-size-33.5MB, which is gotten by using normal way of clEnqueueWriteBuffer():

    for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        clEnqueueWriteBuffer(cqCommandQueue, cmDevData, CL_TRUE, 0, memSize, h_data, 0, NULL, NULL);
        clFinish(cqCommandQueue);
    }

所以,我的最后一个问题是,在Nvidia OpenCL环境中使用映射(零复制)机制的正确和最有效的方法是什么?

So, my final question is that what is the correct and most efficient way to use mapped(zero-copy) mechanism in Nvidia OpenCL environment?

根据@DarkZeros的建议,我对map-unmap方法进行了更多测试.

According to @DarkZeros 's suggestion, I did more tests on the map-unmap method.

方法1就像@DarkZeros的方法一样:

Method 1 is just as @DarkZeros 's method:

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME

上述方法获得了1900MB/s的速度.它仍然大大低于普通的写缓冲方法.更重要的是,此方法实际上并不接近主机与设备之间的实际情况,因为映射操作不在性能分析间隔之内.因此,我们不能多次运行分析间隔.如果要多次运行分析间隔,则必须将map操作放在分析间隔内.因为如果要使用性能分析间隔/块作为传输数据的子功能,则每次调用此子功能之前都必须进行映射操作(因为该子功能内部未映射).因此,映射操作应在分析间隔中进行计数.所以我做了第二次测试:

The above method got 1900MB/s. It is still lower than normal write-buffer method significantly. And more important, this method actually is not close to real case between host and device, because the map operation is outside profiling interval. So we can't run the profiling interval for many times. If we want to run the profiling interval for many times, we have to put the map operation inside the profiling interval. Because if we want to use the profiling interval/block as an sub-function which transfer data, we have to do map operation before each time we call this sub-function (because there is unmap inside the sub-function). So the map operation should be counted in the profiling interval. So I did the second test:

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

void* dm_idata[MEMCOPY_ITERATIONS];

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // get pointers mapped to device buffers cmDevData
    dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData[i], dm_idata[i], 0, NULL, NULL);
}
clFinish(cqCommandQueue);
//Measure the ENDTIME

这将产生980MB/s的速度,与之前的结果相同. 从数据传输的角度来看,Nvida的OpenCL实施似乎很难达到与CUDA相同的性能.

And this generates 980MB/s, just the same as before result. Seems that Nvida's OpenCL implementation hardly can achieve the same performance as CUDA from the perspective of data transfer.

推荐答案

这里首先要注意的是, OpenCL不允许固定零副本(在2.0中可用,但尚未准备好)使用).这意味着您无论如何都必须执行一份复制到GPU内存的操作.

First thing to note here is, OpenCL does not allow pinned zero-copy (in 2.0 it is available, but not yet ready to use). This mean you will have to perform a copy anyway to the GPU memory.

有两种执行内存复制的方法:

There are 2 ways to perform the memory copy:

  1. clEnqueueWriteBuffer()/clEnqueueReadBuffer():它们在上下文中(通常在设备中)执行到OpenCL对象的直接复制/复制到主机端指针.效率很高,但是对于少量字节可能效率不高.

  1. clEnqueueWriteBuffer()/clEnqueueReadBuffer(): These perform a direct copy from/to an OpenCL object in the context (typically in the device) to a host side pointer. The efficiency is high, but maybe they are not efficient for small quantities of bytes.

clEnqueueMapBuffer()/clEnqueueUnmapBuffer():这些调用首先将设备存储区映射到主机存储区.此map生成内存的1:1副本.然后,在映射之后,您可以使用memcopy()或其他方法来使用该内存.完成存储器编辑后,调用unmap,然后将其传送回设备. 通常,此选项会更快,因为OpenCL在映射时会为您提供指针.您可能已经在上下文的主机缓存中进行了编写.但与此相对的是,当您调用map时,内存传输正以另一种方式发生(GPU->主机)

clEnqueueMapBuffer()/clEnqueueUnmapBuffer(): These calls first map a device memory zone to the host memory zone. This map generates a 1:1 copy of the memory. Then, after the map you can play with that memory using memcopy() or other approaches. After you finish with the memory editing you call the unmap, which then transfers this memory back to the device. Typically this option is faster, since OpenCL gives you the pointer when you map. It is likely you are already writing in the host cache of the context. But the counterpart is that when you call map the memory transfer is occurring the other way around (GPU->host)

编辑:在后一种情况下,如果您选择标志CL_WRITE_ONLY进行映射,则可能不会触发设备在地图操作上托管副本.只读也会发生相同的情况,不会触发取消映射上的设备复制.

In this last case if you select the flag CL_WRITE_ONLY for mapping, it is probably NOT triggering a device to host copy on the map operation. The same thing happens with the read only, that will NOT trigger a device copy on the unmap.

在您的示例中,很明显,使用映射/取消映射"方法将使操作更快. 但是,如果在循环内执行memcpy()而不调用unmap,则实际上不会将任何内容复制到设备端.如果放置map/unmap循环,性能将下降,并且如果缓冲区大小较小(1MB),则传输速率将非常差.但是,如果您在小尺寸的for循环中执行写操作,则在写/读"情况下也会发生这种情况.

In your example, it is clear that using the Map/Unmap approach the operation is going to be faster. However, if you do memcpy() inside a loop without calling the unmap, that is effectively NOT copying anything to the device side. If you put a loop of map/unmap the performance is going to decrease, and if the buffer size is small (1MB) the transfer rates will be very poor. However this will also happen in the Write/Read case if you perform the writes in a for loop with small sizes.

通常,您不应使用1MB的大小,因为在这种情况下,开销会非常大(除非您在非阻塞模式下将许多写调用排队).

In general, you should not use 1MB sizes, since the overhead will be very high in this cases (unless you queue many write call in a non-blocking mode).

PD:我的个人建议是,仅使用常规的Write/Read,因为对于大多数常见用途而言,差异并不明显.特别是重叠的I/O和内核执行.但是,如果您确实需要性能,请使用具有读/写功能的映射/取消映射或固定内存,它可以提供10-30%的更高传输速率.

与您所遇到的行为有关,在检查nVIDIA代码后,我可以向您解释.您看到的问题主要是由阻塞和非阻塞调用产生的,这些调用隐藏"了OpenCL调用的开销

第一个代码:(nVIDIA)

  • 正在一次对BLOCKING地图进行排队
  • 然后执行许多memcpys(但只有最后一个会进入GPU端).
  • 然后以非阻塞方式取消映射.
  • 不使用clFinish()
  • 读取结果
  • Is queueing a BLOCKING map once
  • Then performing many memcpys (but only the last one will go to the GPU side).
  • Then unmapping it in a non-blocking manner.
  • Reading the result without clFinish()

此代码示例为 WRONG !它并不能真正衡量HOST-GPU的复制速度.因为memcpy()不能确保GPU副本,并且因为缺少clFinish().这就是为什么您甚至看到速度超过限制的原因.

This code example is WRONG! It is not really measuring the HOST-GPU copy speed. Because the memcpy() does not ensure a GPU copy and because there is a clFinish() missing. That's why you even see speeds over the limit.

第二个代码:(您的)

  • 正在循环多次对BLOCKING映射进行排队.
  • 然后为每个地图执行1个memcpy().
  • 然后以非阻塞方式取消映射.
  • 不使用clFinish()
  • 读取结果
  • Is queueing a BLOCKING map many times in a loop.
  • Then performing 1 memcpy() for each map.
  • Then unmapping it in a non-blocking manner.
  • Reading the result without clFinish()

您的代码仅缺少clFinish().但是,由于循环中的地图受阻,结果几乎是正确的.但是,GPU处于空闲状态,直到CPU参与下一次迭代为止,因此您会看到不现实的非常低的性能.

Your code only lacks of the clFinish(). However as the map in the loop is blocking the results are almost correct. However, the GPU is idle until the CPU attends the next iteration, so you are seeing a non-realistic very low performance.

读写代码:(nVIDIA)

  • 正在对无阻塞写入进行多次排队.
  • 使用clFinish()
  • 读取结果
  • Is queueing a nonblocking write many times.
  • Reading the result with clFinish()

此代码可以正确并行地进行复制,您将在此处看到实际带宽.

This code is properly doing the copy, in parallel, and you are seeing the real bandwidth here.

为了将地图示例转换为类似于Write/Read情况的东西. 您应该这样(没有固定的内存):

In order to convert the map example into something comparable to the Write/Read case. You should so it like this (this is without pinned memory):

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
}
clFinish(cqCommandQueue);

//Measure the ENDTIME

在映射的情况下,您不能重用相同的缓冲区,因为否则在每次迭代之后您都将阻塞.并且GPU将处于空闲状态,直到CPU重新排队下一个复制作业为止.

You can't reuse the same buffer in the mapped case since otherwise after each iteration you would block. And the GPU will be idle until the CPU requeues the next copy job.

这篇关于在Nvidia OpenCL环境中使用映射(零复制)内存机制的正确和最有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆