从OpenGL中的默认帧缓冲区读取像素数据:FBO与PBO的性能 [英] Read pixel data from default framebuffer in OpenGL: Performance of FBO vs. PBO

查看:1061
本文介绍了从OpenGL中的默认帧缓冲区读取像素数据:FBO与PBO的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是读取默认OpenGL帧缓冲区的内容,并将像素数据存储在cv::Mat中.显然,有两种 方式可以实现这一目标:

My goal is to read the contents of the default OpenGL framebuffer and store the pixel data in a cv::Mat. Apparently there are two different ways of achieving this:

1)同步:使用FBO和glRealPixels

1) Synchronous: use FBO and glRealPixels

cv::Mat a = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, a.data);

2)异步:使用PBO和glReadPixels

2) Asynchronous: use PBO and glReadPixels

cv::Mat b = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_userImage);
    glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, 0);
    unsigned char* ptr = static_cast<unsigned char*>(glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY));
    std::copy(ptr, ptr + 1920 * 1080 * 3 * sizeof(unsigned char), b.data);
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

从我收集的有关该主题的所有信息中,异步版本2)应该更快.但是,比较两个版本的经过时间会发现差异通常是最小的,有时版本1)事件的效果要优于PBO变体.

From all the information I collected on this topic, the asynchronous version 2) should be much faster. However, comparing the elapsed time for both versions yields that the differences are often times minimal, and sometimes version 1) events outperforms the PBO variant.

为了进行性能检查,我插入了以下代码(基于答案) :

For performance checks, I've inserted the following code (based on this answer):

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
....
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;

在创建PBO时,我还尝试了使用提示:在这里,我没有发现GL_DYNAMIC_COPYGL_STREAM_READ之间的差异.

I've also experimented with the usage hint when creating the PBO: I didn't find much of difference between GL_DYNAMIC_COPY and GL_STREAM_READ here.

我很乐意提出一些建议,以进一步提高从帧缓冲区读取像素的速度.

I'd be happy for suggestions how to increase the speed of this pixel read operation from the framebuffer even further.

推荐答案

您的第二个版本根本不是异步的,因为您在触发复制后立即映射了缓冲区.然后,映射调用将阻塞,直到缓冲区的内容可用为止,从而有效地变得同步.

Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. The map call will then block until the contents of the buffer are available, effectively becoming synchronous.

或者:根据驱动程序,在实际读取驱动程序时它将阻塞.换句话说,驱动程序可以以导致页面错误和随后的同步的方式来实现映射.对于您来说,这并不重要,因为std::copy,您仍然可以直接访问该数据.

Or: depending on the driver, it will block when actually reading from it. In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy.

正确的方法是通过使用同步对象和围栏.

保留您的PBO设置,但是将glReadPixels发行到PBO中之后,通过glFenceSync将同步对象插入流中.然后,一段时间后,通过glClientWaitSync轮询该篱笆同步对象是否完整(或完全等待).

Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync. Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync.

如果glClientWaitSync返回在隔离栅完成之前的命令,则现在可以从缓冲区读取数据,而无需进行昂贵的CPU/GPU同步. (如果驱动程序特别愚蠢,并且尚未将缓冲区内容移动到可映射的地址中,尽管您在PBO上有用法提示,您仍可以使用另一个线程来执行映射.因此glGetBufferSubData可以更便宜,因为数据不必在可映射范围内.)

If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.)

如果您需要逐帧执行此操作,您会注意到很可能需要多个PBO,也就是说,它们的池子很小.这是因为在下一帧,尚未完成对前一帧数据的回读,并且未发出相应的信号. (是的,这些天GPU已大量流水线化,它们将在您的提交队列后面一些帧).

If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue).

这篇关于从OpenGL中的默认帧缓冲区读取像素数据:FBO与PBO的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆