什么时候 VBO 比“简单"更快?OpenGL 基元 (glBegin())? [英] When are VBOs faster than "simple" OpenGL primitives (glBegin())?

查看:17
本文介绍了什么时候 VBO 比“简单"更快?OpenGL 基元 (glBegin())?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在听说顶点缓冲对象 (VBO) 多年之后,我终于决定尝试使用它们(我的东西通常对性能不是很关键,显然......)

After many years of hearing about Vertex Buffer Objects (VBOs), I finally decided to experiment with them (my stuff isn't normally performance critical, obviously...)

我将在下面描述我的实验,但长话短说,我发现简单"直接模式(glBegin()/glEnd())、顶点数组(CPU 端)和 VBO(GPU 端)渲染模式.我试图理解为什么会这样,以及在什么条件下我可以期望看到 VBO 显着超过其原始(双关语)祖先.

I'll describe my experiment below, but to make a long story short, I'm seeing indistinguishable performance between "simple" direct mode (glBegin()/glEnd()), vertex array (CPU side) and VBO (GPU side) rendering modes. I'm trying to understand why this is, and under what conditions I can expect to see the VBOs significantly outshine their primitive (pun intended) ancestors.

在实验中,我生成了一个包含大量点的(静态)3D 高斯云.每个点都有顶点 &与之相关的颜色信息.然后我在连续的帧中围绕云旋转相机,以某种轨道"行为.同样,这些点是静态的,只有眼睛在移动(通过 gluLookAt()).在任何渲染之前生成一次数据.存储在两个数组中以用于渲染循环.

For the experiment, I generated a (static) 3D Gaussian cloud of a large number of points. Each point has vertex & color information associated with it. Then I rotated the camera around the cloud in successive frames in sort of an "orbiting" behavior. Again, the points are static, only the eye moves (via gluLookAt()). The data are generated once prior to any rendering & stored in two arrays for use in the rendering loop.

对于直接渲染,整个数据集在单个 glBegin()/glEnd() 块中渲染,其中一个循环包含对 glColor3fv() 和 glVertex3fv() 的单个调用.

For direct rendering, the entire data set is rendered in a single glBegin()/glEnd() block with a loop containing a single call each to glColor3fv() and glVertex3fv().

对于顶点数组和 VBO 渲染,使用单个 glDrawArrays() 调用渲染整个数据集.

For vertex array and VBO rendering, the entire data set is rendered with a single glDrawArrays() call.

然后,我只是在一个紧凑的循环中运行它大约一分钟左右,并使用高性能计时器测量平均 FPS.

Then, I simply run it for a minute or so in a tight loop and measure average FPS with the high performance timer.

如上所述,我的台式机(XP x64、8GB RAM、512 MB Quadro 1700)和我的笔记本电脑(XP32、4GB ram、256 MB Quadro NVS 110)的性能没有区别.然而,它确实按照预期的点数进行了扩展.显然,我也禁用了 vsync.

As mentioned above, performance was indistinguishable on both my desktop machine (XP x64, 8GB RAM, 512 MB Quadro 1700), and my laptop (XP32, 4GB ram, 256 MB Quadro NVS 110). It did scale as expected with the number of points, however. Obviously, I also disabled vsync.

笔记本电脑运行的具体结果(使用 GL_POINTS 渲染):

Specific results from laptop runs (rendering w/GL_POINTS):

glBegin()/glEnd():

glBegin()/glEnd():

  • 1K pts --> 603 FPS
  • 10K 点 --> 401 FPS
  • 100K 点 --> 97 FPS
  • 100 万点 --> 14 FPS

顶点数组(CPU 端):

Vertex Arrays (CPU side):

  • 1K pts --> 603 FPS
  • 10K 点 --> 402 FPS
  • 100K 点 --> 97 FPS
  • 100 万点 --> 14 FPS

顶点缓冲对象(GPU 端):

Vertex Buffer Objects (GPU side):

  • 1K 点 --> 604 FPS
  • 10K 点 --> 399 FPS
  • 10 万点 --> 95 帧/秒
  • 100 万点 --> 14 FPS

我使用 GL_TRIANGLE_STRIP 渲染了相同的数据,并且同样难以区分(尽管由于额外的光栅化而比预期的要慢).如果有人想要,我也可以发布这些数字..

I rendered the same data with GL_TRIANGLE_STRIP and got similarly indistinguishable (though slower as expected due to extra rasterization). I can post those numbers too if anybody wants them. .

  • 是什么?
  • 我必须做什么才能实现 IBO 承诺的性能提升?
  • 我错过了什么?

推荐答案

优化 3D 渲染的因素有很多.通常有 4 个瓶颈:

There are a lot of factors to optimizing 3D rendering. usually there are 4 bottlenecks:

  • CPU(创建顶点、APU 调用、其他一切)
  • 总线(CPU<->GPU 传输)
  • Vertex(固定函数管道执行上的顶点着色器)
  • 像素(填充、片段着色器执行和 rop)

您的测试给出了偏斜的结果,因为您在最大化顶点或像素吞吐量的同时拥有大量 CPU(和总线).VBO 用于降低 CPU(更少的 api 调用,与 CPU DMA 传输并行).由于您不受 CPU 限制,因此它们不会给您带来任何收益.这是优化 101.例如,在游戏中,CPU 变得非常宝贵,因为它是 AI 和物理等其他事物所需要的,而不仅仅是发出大量 api 调用.很容易看出,将顶点数据(例如 3 个浮点数)直接写入内存指针比调用将 3 个浮点数写入内存的函数要快得多 - 至少可以节省调用周期.

Your test is giving skewed results because you have a lot of CPU (and bus) while maxing out vertex or pixel throughput. VBOs are used to lower CPU (fewer api calls, parallel to CPU DMA transfers). Since you are not CPU bound, they don't give you any gain. This is optimization 101. In a game for example CPU becomes precious as it is needed for other things like AI and physics, not just for issuing tons of api calls. It is easy to see that writing vertex data (3 floats for example) directly to a memory pointer is much faster than calling a function that writes 3 floats to memory - at the very least you save the cycles for the call.

这篇关于什么时候 VBO 比“简单"更快?OpenGL 基元 (glBegin())?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆