如何提高自定义OpenGL ES 2.0深度纹理生成的性能? [英] How can I improve the performance of my custom OpenGL ES 2.0 depth texture generation?

查看:191
本文介绍了如何提高自定义OpenGL ES 2.0深度纹理生成的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个开源iOS应用程序,它使用自定义OpenGL ES 2.0着色器来显示分子结构的三维表示。它通过使用在矩形上绘制的程序生成的球体和圆柱体冒充者来实现这一点,而不是使用大量顶点构建的这些相同形状。这种方法的缺点是这些冒名顶替对象的每个片段的深度值需要在片段着色器中计算,以便在对象重叠时使用。



不幸的是, OpenGL ES 2.0 用于优化光栅化的三角形。在之前的测试中,尽管删除了许多不必要的碎片并且让您更有效地阻挡被覆盖的碎片,但八边形的性能比方块更差。通过如下调整三角形绘图:





通过从正方形切换到八边形,我能够在上述优化的基础上将整体渲染时间平均减少14%。深度纹理现在在19毫秒内生成,偶尔下降到2毫秒,峰值达到35毫秒。



编辑2(2011年5月31日):



我重新审视了Tommy使用step函数的想法,现在我因为八边形而丢弃了更少的碎片。这与球体的深度查找纹理相结合,现在导致iPad 1平均渲染时间为2毫秒,用于测试模型的深度纹理生成。在这个渲染案例中,我认为这是我希望的那么好,以及我从哪里开始的巨大改进。对于后代,这是我现在使用的深度着色器:

  precision mediump float; 

不同的mediump vec2 impostorSpaceCoordinate;
不同的mediump float normalizedDepth;
不同的mediump浮动adjustSphereRadius;
不同的mediump vec2 depthLookupCoordinate;

统一lowp sampler2D sphereDepthMap;

const lowp vec3 stepValues = vec3(2.0,1.0,0.0);

void main()
{
lowp vec2 precalculatedDepthAndAlpha = texture2D(sphereDepthMap,depthLookupCoordinate).ra;

float inCircleMultiplier = step(0.5,precalculatedDepthAndAlpha.g);

float currentDepthValue = normalizedDepth + adjustedSphereRadius - adjustedSphereRadius * precalculatedDepthAndAlpha.r;

//深度值的内联颜色编码
currentDepthValue = currentDepthValue * 3.0;

lowp vec3 intDepthValue = vec3(currentDepthValue) - stepValues;

gl_FragColor = vec4(1.0 - inCircleMultiplier)+ vec4(intDepthValue,inCircleMultiplier);
}

我已经更新了测试样本这里,如果你希望看到这种新方法与我最初的做法相比。



我仍然愿意接受其他建议,但这是此申请的一大进步。


I have an open source iOS application that uses custom OpenGL ES 2.0 shaders to display 3-D representations of molecular structures. It does this by using procedurally generated sphere and cylinder impostors drawn over rectangles, instead of these same shapes built using lots of vertices. The downside to this approach is that the depth values for each fragment of these impostor objects needs to be calculated in a fragment shader, to be used when objects overlap.

Unfortunately, OpenGL ES 2.0 does not let you write to gl_FragDepth, so I've needed to output these values to a custom depth texture. I do a pass over my scene using a framebuffer object (FBO), only rendering out a color that corresponds to a depth value, with the results being stored into a texture. This texture is then loaded into the second half of my rendering process, where the actual screen image is generated. If a fragment at that stage is at the depth level stored in the depth texture for that point on the screen, it is displayed. If not, it is tossed. More about the process, including diagrams, can be found in my post here.

The generation of this depth texture is a bottleneck in my rendering process and I'm looking for a way to make it faster. It seems slower than it should be, but I can't figure out why. In order to achieve the proper generation of this depth texture, GL_DEPTH_TEST is disabled, GL_BLEND is enabled with glBlendFunc(GL_ONE, GL_ONE), and glBlendEquation() is set to GL_MIN_EXT. I know that a scene output in this manner isn't the fastest on a tile-based deferred renderer like the PowerVR series in iOS devices, but I can't think of a better way to do this.

My depth fragment shader for spheres (the most common display element) looks to be at the heart of this bottleneck (Renderer Utilization in Instruments is pegged at 99%, indicating that I'm limited by fragment processing). It currently looks like the following:

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;

const vec3 stepValues = vec3(2.0, 1.0, 0.0);
const float scaleDownFactor = 1.0 / 255.0;

void main()
{
    float distanceFromCenter = length(impostorSpaceCoordinate);
    if (distanceFromCenter > 1.0)
    {
        gl_FragColor = vec4(1.0);
    }
    else
    {
        float calculatedDepth = sqrt(1.0 - distanceFromCenter * distanceFromCenter);
        mediump float currentDepthValue = normalizedDepth - adjustedSphereRadius * calculatedDepth;

        // Inlined color encoding for the depth values
        float ceiledValue = ceil(currentDepthValue * 765.0);

        vec3 intDepthValue = (vec3(ceiledValue) * scaleDownFactor) - stepValues;

        gl_FragColor = vec4(intDepthValue, 1.0);
    }
}

On an iPad 1, this takes 35 - 68 ms to render a frame of a DNA spacefilling model using a passthrough shader for display (18 to 35 ms on iPhone 4). According to the PowerVR PVRUniSCo compiler (part of their SDK), this shader uses 11 GPU cycles at best, 16 cycles at worst. I'm aware that you're advised not to use branching in a shader, but in this case that led to better performance than otherwise.

When I simplify it to

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;

void main()
{
    gl_FragColor = vec4(adjustedSphereRadius * normalizedDepth * (impostorSpaceCoordinate + 1.0) / 2.0, normalizedDepth, 1.0);
}

it takes 18 - 35 ms on iPad 1, but only 1.7 - 2.4 ms on iPhone 4. The estimated GPU cycle count for this shader is 8 cycles. The change in render time based on cycle count doesn't seem linear.

Finally, if I just output a constant color:

precision mediump float;

void main()
{
    gl_FragColor = vec4(0.5, 0.5, 0.5, 1.0);
}

the rendering time drops to 1.1 - 2.3 ms on iPad 1 (1.3 ms on iPhone 4).

The nonlinear scaling in rendering time and sudden change between iPad and iPhone 4 for the second shader makes me think that there's something I'm missing here. A full source project containing these three shader variants (look in the SphereDepth.fsh file and comment out the appropriate sections) and a test model can be downloaded from here, if you wish to try this out yourself.

If you've read this far, my question is: based on this profiling information, how can I improve the rendering performance of my custom depth shader on iOS devices?

解决方案

Based on the recommendations by Tommy, Pivot, and rotoglup, I've implemented some optimizations which have led to a doubling of the rendering speed for the both the depth texture generation and the overall rendering pipeline in the application.

First, I re-enabled the precalculated sphere depth and lighting texture that I'd used before with little effect, only now I use proper lowp precision values when handling the colors and other values from that texture. This combination, along with proper mipmapping for the texture, seems to yield a ~10% performance boost.

More importantly, I now do a pass before rendering both my depth texture and the final raytraced impostors where I lay down some opaque geometry to block pixels that would never be rendered. To do this, I enable depth testing and then draw out the squares that make up the objects in my scene, shrunken by sqrt(2) / 2, with a simple opaque shader. This will create inset squares covering area known to be opaque in a represented sphere.

I then disable depth writes using glDepthMask(GL_FALSE) and render the square sphere impostor at a location closer to the user by one radius. This allows the tile-based deferred rendering hardware in the iOS devices to efficiently strip out fragments that would never appear onscreen under any conditions, yet still give smooth intersections between the visible sphere impostors based on per-pixel depth values. This is depicted in my crude illustration below:

In this example, the opaque blocking squares for the top two impostors do not prevent any of the fragments from those visible objects from being rendered, yet they block a chunk of the fragments from the lowest impostor. The frontmost impostors can then use per-pixel tests to generate a smooth intersection, while many of the pixels from the rear impostor don't waste GPU cycles by being rendered.

I hadn't thought to disable depth writes, yet leave on depth testing when doing the last rendering stage. This is the key to preventing the impostors from simply stacking on one another, yet still using some of the hardware optimizations within the PowerVR GPUs.

In my benchmarks, rendering the test model I used above yields times of 18 - 35 ms per frame, as compared to the 35 - 68 ms I was getting previously, a near doubling in rendering speed. Applying this same opaque geometry pre-rendering to the raytracing pass yields a doubling in overall rendering performance.

Oddly, when I tried to refine this further by using inset and circumscribed octagons, which should cover ~17% fewer pixels when drawn, and be more efficient with blocking fragments, performance was actually worse than when using simple squares for this. Tiler utilization was still less than 60% in the worst case, so maybe the larger geometry was resulting in more cache misses.

EDIT (5/31/2011):

Based on Pivot's suggestion, I created inscribed and circumscribed octagons to use instead of my rectangles, only I followed the recommendations here for optimizing triangles for rasterization. In previous testing, octagons yielded worse performance than squares, despite removing many unnecessary fragments and letting you block covered fragments more efficiently. By adjusting the triangle drawing as follows:

I was able to reduce overall rendering time by an average of 14% on top of the above-described optimizations by switching to octagons from squares. The depth texture is now generated in 19 ms, with occasional dips to 2 ms and spikes to 35 ms.

EDIT 2 (5/31/2011):

I've revisited Tommy's idea of using the step function, now that I have fewer fragments to discard due to the octagons. This, combined with a depth lookup texture for the sphere, now leads to a 2 ms average rendering time on the iPad 1 for the depth texture generation for my test model. I consider that to be about as good as I could hope for in this rendering case, and a giant improvement from where I started. For posterity, here is the depth shader I'm now using:

precision mediump float;

varying mediump vec2 impostorSpaceCoordinate;
varying mediump float normalizedDepth;
varying mediump float adjustedSphereRadius;
varying mediump vec2 depthLookupCoordinate;

uniform lowp sampler2D sphereDepthMap;

const lowp vec3 stepValues = vec3(2.0, 1.0, 0.0);

void main()
{
    lowp vec2 precalculatedDepthAndAlpha = texture2D(sphereDepthMap, depthLookupCoordinate).ra;

    float inCircleMultiplier = step(0.5, precalculatedDepthAndAlpha.g);

    float currentDepthValue = normalizedDepth + adjustedSphereRadius - adjustedSphereRadius * precalculatedDepthAndAlpha.r;

    // Inlined color encoding for the depth values
    currentDepthValue = currentDepthValue * 3.0;

    lowp vec3 intDepthValue = vec3(currentDepthValue) - stepValues;

    gl_FragColor = vec4(1.0 - inCircleMultiplier) + vec4(intDepthValue, inCircleMultiplier);
}

I've updated the testing sample here, if you wish to see this new approach in action as compared to what I was doing initially.

I'm still open to other suggestions, but this is a huge step forward for this application.

这篇关于如何提高自定义OpenGL ES 2.0深度纹理生成的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆