为什么用单个 bufferData 调用替换这些矩阵转换会如此慢? [英] Why is replacing these matrix transformations with a single bufferData call so much slower?

查看:73
本文介绍了为什么用单个 bufferData 调用替换这些矩阵转换会如此慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化我绘制精灵的着色器,我最初有这样的东西:

I'm trying to optimize my shader that draws sprites and I originally had something like this:

// this matrix will convert from pixels to clip space
var matrix = m3.projection(this.camera.viewportWidth / this.camera.scale, this.camera.viewportHeight / this.camera.scale);

// this matrix will translate our quad to dstX, dstY
matrix = m3.translate(matrix, dstX, dstY);

// this matrix will scale our 1 unit quad
// from 1 unit to texWidth, texHeight units
matrix = m3.scale(matrix, dstWidth, dstHeight);

gl.uniformMatrix3fv(attribs.matrixLocation, false, matrix);

以上代码的灵感来自本教程:https://webglfundamentals.org/webgl/lessons/webgl-2d-drawimage.html

The above code is inspired from this tutorial: https://webglfundamentals.org/webgl/lessons/webgl-2d-drawimage.html

这行得通,但我已经保存了我的相机矩阵变换,所以我想避免每帧都进行所有这些矩阵变换.每个 m3.whatever 调用都会分配一个新数组,所以我想用以下内容替换它:

Which worked, but I already have my camera matrix transformation saved, so I wanted to avoid having to do all of those matrix transformations each frame. Each of those m3.whatever calls allocates a new array, so I thought to replace it with the following:

gl.bindBuffer(gl.ARRAY_BUFFER, attribs.positionBuffer);

attribs.positionsQuad[0] = dstX;
attribs.positionsQuad[1] = dstY + dstHeight;
attribs.positionsQuad[2] = dstX;
attribs.positionsQuad[3] = dstY;
attribs.positionsQuad[4] = dstX + dstWidth;
attribs.positionsQuad[5] = dstY + dstHeight;

attribs.positionsQuad[6] = dstX + dstWidth;
attribs.positionsQuad[7] = dstY + dstHeight;
attribs.positionsQuad[8] = dstX;
attribs.positionsQuad[9] = dstY;
attribs.positionsQuad[10] = dstX + dstWidth;
attribs.positionsQuad[11] = dstY;

gl.bufferData(gl.ARRAY_BUFFER, attribs.positionsQuad, gl.DYNAMIC_DRAW);
gl.vertexAttribPointer(attribs.positionLocation, 2, gl.FLOAT, false, 0, 0);

gl.uniformMatrix3fv(attribs.matrixLocation, false, camera.ClipTransform);

这也有效,但现在我的帧率非常高.有人知道为什么吗?我尝试对其进行分析,它确实说我的图像绘制着色器现在变慢了,但我不确定这是怎么回事.我将一堆矩阵分配和转换替换为写入单个预先分配的数组,然后将其传输,现在速度要慢得多?

Which also works, but now my frame-rate is very spikey. Does anyone know why this is? I tried profiling it and it indeed says that my image drawing shader is now slower, but I'm not sure how this could be. I replaced a bunch of matrix allocations and transformations with writing to a single pre-allocated array and then transferring that, and now it's much slower?

似乎很多帧率跳跃可能是由于垃圾收集器运行造成的,但即使这样对我来说也没有意义.使用最初的解决方案,应该有更多的垃圾,考虑到我正在每帧分配和丢弃大量带有所有这些矩阵转换的数组.现在我根本没有分配,那么为什么现在 GC 使用量会激增?

It seems that a lot of the frame rate jumps may be due to the garbage collector running, but even this doesn't make sense to me. With the initial solution, there should have been so much more garbage, considering I'm allocating and throwing away a ton of arrays each frame with all those matrix transformations. And now I'm not allocating at all, so why would GC usage spike now?

有没有更好的方法来实现这一点?我已经在这里上传了我的整个着色器以供参考:https://pastebin.com/tdCYpDqv

Is there a better way to accomplish this? I've uploaded my entire shader here for reference: https://pastebin.com/tdCYpDqv

推荐答案

对于大多数图形 API 命令,所发生的情况是命令被编码在命令缓冲区中,在某些时候(异步)这些缓冲区通过以下方式同步到 GPU图形驱动程序.为了使命令缓冲区可预测,所有数据都需要复制到缓冲区中.

For most graphics API commands what happens is that the command is encoded in a command-buffer, at some point (asynchronously) those buffers are synchronized to the GPU by the graphics driver. For a command buffer to be predictable all data needs to be copied to be put into the buffer.

现在您的代码的一个问题是您正在设置数据并立即要求 GPU 从中提取数据,这需要对完整缓冲区进行硬同步.驱动程序希望制服需要同步但不一定需要数组缓冲区,使用提示(DYNAMICSTREAMSTATIC draw)并没有真正做到关于这一点(实际上在大多数情况下 STATIC_DRAW 即使对于动态数据也更快).

Now one problem with your code is that you're setting the data and immediately ask the GPU to draw from it, requiring a hard sync of the complete buffer. The driver expects uniforms to need syncing but not necessarily array buffers, the usage hints (DYNAMIC,STREAM and STATIC draw) don't really do much about that (actually in most cases STATIC_DRAW is faster even for dynamic data).

当这些硬同步发生时,您几乎总是会停止管道,这意味着 GPU 需要等待所有数据传输完毕,然后才能继续执行其正在执行的任何操作.您可以通过使用双倍甚至三倍缓冲(为下一帧写入数据但渲染当前帧等)来避免这种情况.

When these hard syncs happen you're almost always stalling the pipeline, meaning the GPU needs to wait for all the data to be transferred before it can continue doing whatever it was doing. You can avoid this by utilizing double or even triple buffering (write data for the next frame but render current one etc.).

然而尽管如此,试图优化 6 个四边形的绘制是非常有问题的,因为(在这种情况下)我们在这里谈论的是不可估量的差异,改变一件事可能会改变另一件事更改帧时间,但它并没有说明可扩展性,因为您实际上只是在测量(通常是静态的)开销而不是实际性能.

However with all this being said, trying to optimize the draw of 6 quads is very problematic as (in this context) we're talking about immeasurable differences here, changing one thing over the other might change the frame-time but it doesn't say anything about scalability as you're really just measuring the (often static) overhead rather than the actual performance.

这篇关于为什么用单个 bufferData 调用替换这些矩阵转换会如此慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆