沉重的片段着色器的性能优化 [英] Optimizing performance of a heavy fragment shader

查看：311 发布时间：2016/3/8 11:17:03 android opengl-es opengl-es-2.0 shader mali

本文介绍了沉重的片段着色器的性能优化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要帮助优化了以下一组着色器：

顶点：

  precision mediump浮动;统一VEC2 rubyTextureSize;属性vec4 vPosition;
属性VEC2 a_TexCoordinate;不同VEC2 TC;无效的主要（）{
    GL_POSITION = vPosition;    TC = a_TexCoordinate;
}

片段：

  precision mediump浮动;/ *
 制服
  -  rubyTexture：纹理采样器
  -  rubyTextureSize：纹理的大小呈现之前
 * /统一sampler2D rubyTexture;
统一VEC2 rubyTextureSize;
统一VEC2 rubyTextureFract;/ *
 不同属性
  -  TC：纹理像素被处理协调
  -  XYP _ [] _ [] _ []：一个拥挤的坐标纹理内的3个区域
 * /不同VEC2 TC;/ *
 常量
 * /
/ *
 不等式系数插值
 方程的形式为：好哦+的Bx = C
 45，30和60表示从X每个cooeficient变量集合构建线的角度
 * /
常量vec4艾= vec4（1.0，-1.0，-1.0，1.0）;
常量vec4 B45 = vec4（1.0，1.0，-1.0，-1.0）;
常量vec4 C45 = vec4（1.5，0.5，-0.5，0.5）;
常量vec4 B30 = vec4（0.5，2.0，-0.5，-2.0）;
常量vec4 C30 = vec4（1.0，1.0，-0.5，0.0）;
常量vec4 B60 = vec4（2.0，0.5，-2.0，-0.5）;
常量vec4 C60 = vec4（2.0，0.0，-1.0，0.5）;常量vec4 M45 = vec4（0.4，0.4，0.4，0.4）;
常量vec4 M30 = vec4（0.2，0.4，0.2，0.4）;
常量vec4 M60 = M30.yxwz;
常量vec4 Mshift = vec4（0.2）;//系数加权边缘检测
const的浮动系数= 2.0;
//因为如果亮度是平等的门槛
常量vec4阈值= vec4（0.32）;//从RGB转换为亮度（从GIMP）
常量VEC3 LUM = VEC3（0.21，0.72，0.07）;//执行相同的逻辑操作和放大器;＆安培;为载体
bvec4 _and_（bvec4 A，bvec4 B）{
    返回bvec4（A.x＆安培;＆安培; B.x，A.y和放大器;＆安培; B.y，A.z和放大器;＆安培; B.z，A.W和放大器;＆安培; B.w）;
}//执行相同的逻辑操作||为载体
bvec4 _or_（bvec4 A，bvec4 B）{
    返回bvec4（A.x || B.x，A.y || B.y，A.z || B.z，A.W || B.w）;
}// 4 3色向量转换为1 4值亮度矢量
vec4 lum_to（VEC3 V0，V1 VEC3，VEC3 V2，V3 VEC3）{
    //返回vec4（点（LUM，V0），点（LUM，V1），点（LUM，V2），点（LUM，V3））;    返回mat4（v0.x，1.x版，版本2.x，3.x版，v0.y，v1.y，v2.y，v3.y，v0.z，v1.z，
            v2.z，v3.z，0.0，0.0，0.0，0.0）* vec4（LUM，0.0）;
}//获取2 4-值亮度矢量之间的差
vec4 lum_df（vec4 A，vec4 B）{
    返回ABS（A  -  B）;
}//确定是否2 4值亮度矢量是平等的基础上门槛
bvec4 lum_eq（vec4 A，vec4 B）{
    返回每种不超过（lum_df（A，B），阈值）;
}vec4 lum_wd（vec4一，vec4 B，vec4 C，vec4 D，vec4 E，vec4楼vec4克，vec4高）{
    返回lum_df（A，B）+ lum_df（A，C）+ lum_df（D，E）+ lum_df（D，F）
            + 4.0 * lum_df（G，H）;
}//获取2 3值RGB颜色之间的差
浮动c_df（VEC3 C1，C2 VEC3）{
    VEC3 DF = ABS（C1  -  C2）;
    返回df.r + df.g + df.b;
}无效的主要（）{    / *
     掩码algorhithm
     + ----- + ----- + ----- + ----- + ----- +
     | | 1 | 2 | 3 | |
     + ----- + ----- + ----- + ----- + ----- +
     | 5 | 6 | 7 | 8 | 9 |
     + ----- + ----- + ----- + ----- + ----- +
     | 10 | 11 | 12 | 13 | 14 |
     + ----- + ----- + ----- + ----- + ----- +
     | 15 | 16 | 17 | 18 | 19 |
     + ----- + ----- + ----- + ----- + ----- +
     | | 21 | 22 | 23 | |
     + ----- + ----- + ----- + ----- + ----- +
     * /    浮动X = rubyTextureFract.x;
    浮Y = rubyTextureFract.y;    vec4 xyp_1_2_3 = tc.xxxy + vec4（-x，0.0，X，-2.0 * Y）;
    vec4 xyp_6_7_8 = tc.xxxy + vec4（-x，0.0，X，Y）;
    vec4 xyp_11_12_13 = tc.xxxy + vec4（-x，0.0，X，0.0）;
    vec4 xyp_16_17_18 = tc.xxxy + vec4（-x，0.0，X，Y）;
    vec4 xyp_21_22_23 = tc.xxxy + vec4（-x，0.0，X，2.0 * Y）;
    vec4 xyp_5_10_15 = tc.xyyy + vec4（-2.0 * X，Y，0.0，Y）;
    vec4 xyp_9_14_9 = tc.xyyy + vec4（2.0 * X，Y，0.0，Y）;    //通过与均匀采样进行纹理查找获取掩码值
    VEC3 P1 =的Texture2D（rubyTexture，xyp_1_2_3.xw）.rgb;
    VEC3 P2 =的Texture2D（rubyTexture，xyp_1_2_3.yw）.rgb;
    VEC3 P3 =的Texture2D（rubyTexture，xyp_1_2_3.zw）.rgb;    VEC3 P6 =的Texture2D（rubyTexture，xyp_6_7_8.xw）.rgb;
    VEC3 P7 =的Texture2D（rubyTexture，xyp_6_7_8.yw）.rgb;
    VEC3 P8 =的Texture2D（rubyTexture，xyp_6_7_8.zw）.rgb;    VEC3 P11 =的Texture2D（rubyTexture，xyp_11_12_13.xw）.rgb;
    VEC3 P12 =的Texture2D（rubyTexture，xyp_11_12_13.yw）.rgb;
    VEC3 P13 =的Texture2D（rubyTexture，xyp_11_12_13.zw）.rgb;    VEC3 P16 =的Texture2D（rubyTexture，xyp_16_17_18.xw）.rgb;
    VEC3 P17 =的Texture2D（rubyTexture，xyp_16_17_18.yw）.rgb;
    VEC3 P18 =的Texture2D（rubyTexture，xyp_16_17_18.zw）.rgb;    VEC3 P21 =的Texture2D（rubyTexture，xyp_21_22_23.xw）.rgb;
    VEC3 P22 =的Texture2D（rubyTexture，xyp_21_22_23.yw）.rgb;
    VEC3 P23 =的Texture2D（rubyTexture，xyp_21_22_23.zw）.rgb;    VEC3 P5 =的Texture2D（rubyTexture，xyp_5_10_15.xy）.rgb;
    VEC3 P10 =的Texture2D（rubyTexture，xyp_5_10_15.xz）.rgb;
    VEC3 P15 =的Texture2D（rubyTexture，xyp_5_10_15.xw）.rgb;    VEC3 P9 =的Texture2D（rubyTexture，xyp_9_14_9.xy）.rgb;
    VEC3 P14 =的Texture2D（rubyTexture，xyp_9_14_9.xz）.rgb;
    VEC3 P19 =的Texture2D（rubyTexture，xyp_9_14_9.xw）.rgb;    //在4组的每个点的存储的亮度值
    //这样我们可以一次四个角操作
    vec4 P7 = lum_to（P7，P11，P17，P13）;
    vec4 P8 = lum_to（P8，P6，P16，P18）;
    vec4 P11 = p7.yzwx; // P11，P17，P13，P7
    vec4 P12 = lum_to（P12，P12，P12，P12）;
    vec4 P13 = p7.wxyz; // P13，P7，P11，P17
    vec4 P14 = lum_to（P14，P2，P10，P22）;
    vec4 P16 = p8.zwxy; // P16，P18，P8，P6
    vec4 P17 = p7.zwxy; // P17，P13，P7，P11
    vec4 P18 = p8.wxyz; // P18，P8，P6，P16
    vec4 P19 = lum_to（P19，P3，P5，P21）;
    vec4 P22 = p14.wxyz; // P22，P14，P2，P10
    vec4 P23 = lum_to（P23，P9，P1，P15）;    //量程电流纹理像素坐标[0..1]
    VEC2 FP = FRACT（TC * rubyTextureSize）;    //确定平滑或混合的量可以在纹理像素的角落进行
    vec4 AiMulFpy =艾* fp.y;
    vec4 B45MulFpx = B45 * fp.x;
    vec4 ma45 = smoothstep（C45  -  M45，C45 + M45，AiMulFpy + B45MulFpx）;
    vec4 MA30 = smoothstep（C30  -  M30，C30 + M30，AiMulFpy + B30 * fp.x）;
    vec4 MA60 = smoothstep（C60  -  M60，C60 + M60，AiMulFpy + B60 * fp.x）;
    vec4倩影= smoothstep（C45  -  M45 + Mshift，C45 + M45 + Mshift，
            AiMulFpy + B45MulFpx）;    //执行边的权重计算
    vec4 E45 = lum_wd（P12，P8，P16，P18，P22，P14，P17，P13）;
    vec4 ECONT = lum_wd（P17，P11，P23，P13，P7，P19，P12，P18）;
    vec4 E30 = lum_df（P13，P16）;
    vec4 E60 = lum_df（P8，P17）;    //计算插值规则结果
    bvec4 r45_1 = _and_（notEqual（P12，P13），notEqual（P12，P17））;
    bvec4 r45_2 = _and_（未（lum_eq（P13，P7）），而不是（lum_eq（P13，P8）））;
    bvec4 r45_3 = _and_（不（lum_eq（P17，P11）），不是（lum_eq（P17，P16）））;
    bvec4 r45_4_1 = _and_（未（lum_eq（P13，P14）），而不是（lum_eq（P13，P19）））;
    bvec4 r45_4_2 = _and_（不（lum_eq（P17，P22）），不是（lum_eq（P17，P23）））;
    bvec4 r45_4 = _and_（lum_eq（P12，P18），_or_（r45_4_1，r45_4_2））;
    bvec4 r45_5 = _or_（lum_eq（P12，P16），lum_eq（P12，P8））;
    bvec4 R45 = _and_（r45_1，_or _（_或_（_ or_（r45_2，r45_3），r45_4），r45_5））;
    bvec4 R30 = _and_（notEqual（P12，P16），notEqual（P11，P16））;
    bvec4 R60 = _and_（notEqual（P12，P8），notEqual（P7，P8））;    //结合了边权规则
    bvec4 edr45 = _and_（每种不超过（E45，ECONT），R45）;
    bvec4 edrrn = lessThanEqual（E45，ECONT）;
    bvec4 edr30 = _and_（lessThanEqual（系数* E30，E60），R30）;
    bvec4 edr60 = _and_（lessThanEqual（系数* E60，E30），R60）;    //敲定插值规则，并转换为浮动（真正的0.0虚假，1.0）
    vec4 final45 = vec4（_and _（_ and_（未（edr30），而不是（edr60）），edr45））;
    vec4 final30 = vec4（_and _（_ and_（edr45，不（edr60）），edr30））;
    vec4 final60 = vec4（_and _（_ and_（edr45，不（edr30）），edr60））;
    vec4 final36 = vec4（_and _（_ and_（edr60，edr30），edr45））;
    vec4 finalrn = vec4（_and_（未（edr45），edrrn））;    //确定颜色混合每个角落
    vec4 PX =步骤（lum_df（P12，P17），lum_df（P12，P13））;    //通过合并最终规则的结果，相应确定组合金额
    //在每个角落规则的混合量
    vec4 MAC = final36 * MAX（MA30，MA60）+ final30 * MA30 + final60 * MA60
            + final45 * ma45 + finalrn *倩影;    / *
     周围遍历顺时针和逆时针计算产生的颜色
     纹理像素的角落     最后，选择具有从纹理像素的原始最大的不同结果
     颜色
     * /
    VEC3 RES1 = P12;
    RES1 =混合（RES1，混合（P13，P17，逻辑Px.x），mac.x）;
    RES1 =混合（RES1，混合（P7，P13，PX.Y），mac.y）;
    RES1 =混合（RES1，混合（P11，P7，px.z），mac.z）;
    RES1 =混合（RES1，混合（P17，P11，px.w），MAC.W）;    VEC3 RES2 = P12;
    RES2 =混合（RES2，混合（P17，P11，px.w），MAC.W）;
    RES2 =混合（RES2，混合（P11，P7，px.z），mac.z）;
    RES2 =混合（RES2，混合（P7，P13，PX.Y），mac.y）;
    RES2 =混合（RES2，混合（P13，P17，逻辑Px.x），mac.x）;    gl_FragColor = vec4（MIX（RES1，RES2，步骤（c_df（P12，RES1），c_df（P12，RES2）））
            1.0）;
}

该着色器接收2D纹理的，目的是通过高解析度2D表面精美的缩放（设备屏幕）。
这是SABR缩放算法的情况下，优化它的问题。

它已经工作，并在很高端的设备进行确定（如LG的Nexus 4），但它是较弱的设备很慢的。

这真的对我很重要的设备是三星Galaxy S 2 \\ 3，与马里400MP GPU - 它与这个shader可怕执行

到目前为止，我已经试过：

消除变动充填（建议从ARM的马里指南） - 做小的改进

重载组合（）用我自己的功能 - 做没有好

降低浮法precision到lowp - 没有任何改变

我通过计算渲染（eglSwapBuffers之前和之后）的时间测量性能 - 这给了我表现非常线性和一致的测量

。

除此之外，我真的不知道去哪里找还是什么都可以在这里进行了优化...

我知道，这是一个沉重的算法，我不要求咨询什么替代比例的方法来使用 - 我已经尝试了许多该算法提供了最好的视觉效果。我希望用完全相同的算法优化的方式。

更新

我发现，如果我做的所有纹理以恒定的载体，而不是我得到一个重大的性能提升从属矢量取，所以这显然是一个很大的瓶颈 - 可能是因为缓存。
不过，我仍然需要做那些取。我打了至少做一些与VEC2变动充填的取（没有任何交叉混合），但它并没有提高什么。我不知道什么可能是有效的轮询21纹理元素的好方法。

我发现，在计算的主要部分正在做多次使用完全相同的纹素集合的 - 由于输出被至少X2缩放，以及我与GL_NEAREST轮询。有至少4个片段落在上完全相同的纹理像素。如果比例是高清晰度设备上的X4，有落在同一纹理元素16片段 - 这是一个很大的浪费。
有什么办法来执行额外的着色器传递，将计算所有不跨越多个片段更改值？我想过渲染一个额外的屏幕外的质感，但我需要存储每个纹理像素多个值，不只是一个。

更新

尝试使用已知布尔规则来简化布尔前pressions - 救了我一些操作，但是并没有对性能有任何影响

更新

想过一个办法通过计算，顶点着色器 - 创建我的全屏幕，但有很多对应于每个原始像素缩放前顶点只是有一个几何。例如，如果我原来的质地和320×200我的目标屏幕是1280×800，将有320×200的顶点US $ p $垫均匀。然后，做大多数在那些顶点的计算。问题是 - 我的目标设备（S2 \\ S3）不支持顶点纹理取样

更新

在LG的Nexus 4与三星Galaxy S3的性能测量显示的Nexus 4运行速度更快的10倍以上。这怎么可能？这些是同一代，同样的分辨率等2设备...请问马里400MP是非常糟糕的特定情况？我敢肯定有非常具体的东西，使得它跑的那么慢相比的Nexus 4（但没有发现什么还）。

解决方案

在我的经验，移动GPU性能大致成正比的Texture2D 呼叫的数量。你有21还真是不少。一般内存查找是几百倍的计算速度慢的顺序，所以你可以做大量的计算，并且仍然在纹理查找被瓶颈。（这也意味着优化您的code的剩余部分可能会收效甚微，因为这意味着不是忙在等待纹理查找，这将是空闲的，而它等待纹理查找）。所以你需要减少你叫texture2Ds的数量。

这是很难说如何做到这一点，因为我真的不明白你的着色器，但一些想法：

它分离到水平，然后通过一个垂直通。这仅适用于某些着色器例如模糊，但它会严重降低纹理查找次数。例如，一个5×5高斯模糊天真地做25纹理查找;如果水平然后垂直做的，它仅使用10。

使用线性过滤来'欺骗'。如果4个像素，而不是与启用线性滤波1个像素的中部之间采样准确，则获得所有4个像素的平均值为自由。不过，我不知道它是如何影响你的着色器。在模糊示例再次，采用线性滤波，以在中间像素的一次任一侧采样的两个像素允许你样品5像素3的Texture2D呼叫，减少了5×5模糊只是6个样品为水平和垂直。

只使用一个较小的内核（所以你不要拿那么多采样）。这会影响质量，所以你可能会想一些方法来检测到设备性能，并切换到低质量着色器时，该设备似乎是缓慢的。

I need help optimizing the following set of shaders:

Vertex:

    precision mediump float;

uniform vec2 rubyTextureSize;

attribute vec4 vPosition;
attribute vec2 a_TexCoordinate;

varying vec2 tc;

void main() {
    gl_Position = vPosition;

    tc = a_TexCoordinate;
}

Fragment:

precision mediump float;

/*
 Uniforms
 - rubyTexture: texture sampler
 - rubyTextureSize: size of the texture before rendering
 */

uniform sampler2D rubyTexture;
uniform vec2 rubyTextureSize;
uniform vec2 rubyTextureFract;

/*
 Varying attributes
 - tc: coordinate of the texel being processed
 - xyp_[]_[]_[]: a packed coordinate for 3 areas within the texture
 */

varying vec2 tc;

/*
 Constants
 */
/*
 Inequation coefficients for interpolation
 Equations are in the form: Ay + Bx = C
 45, 30, and 60 denote the angle from x each line the cooeficient variable set builds
 */
const vec4 Ai = vec4(1.0, -1.0, -1.0, 1.0);
const vec4 B45 = vec4(1.0, 1.0, -1.0, -1.0);
const vec4 C45 = vec4(1.5, 0.5, -0.5, 0.5);
const vec4 B30 = vec4(0.5, 2.0, -0.5, -2.0);
const vec4 C30 = vec4(1.0, 1.0, -0.5, 0.0);
const vec4 B60 = vec4(2.0, 0.5, -2.0, -0.5);
const vec4 C60 = vec4(2.0, 0.0, -1.0, 0.5);

const vec4 M45 = vec4(0.4, 0.4, 0.4, 0.4);
const vec4 M30 = vec4(0.2, 0.4, 0.2, 0.4);
const vec4 M60 = M30.yxwz;
const vec4 Mshift = vec4(0.2);

// Coefficient for weighted edge detection
const float coef = 2.0;
// Threshold for if luminance values are "equal"
const vec4 threshold = vec4(0.32);

// Conversion from RGB to Luminance (from GIMP)
const vec3 lum = vec3(0.21, 0.72, 0.07);

// Performs same logic operation as && for vectors
bvec4 _and_(bvec4 A, bvec4 B) {
    return bvec4(A.x && B.x, A.y && B.y, A.z && B.z, A.w && B.w);
}

// Performs same logic operation as || for vectors
bvec4 _or_(bvec4 A, bvec4 B) {
    return bvec4(A.x || B.x, A.y || B.y, A.z || B.z, A.w || B.w);
}

// Converts 4 3-color vectors into 1 4-value luminance vector
vec4 lum_to(vec3 v0, vec3 v1, vec3 v2, vec3 v3) {
    //    return vec4(dot(lum, v0), dot(lum, v1), dot(lum, v2), dot(lum, v3));

    return mat4(v0.x, v1.x, v2.x, v3.x, v0.y, v1.y, v2.y, v3.y, v0.z, v1.z,
            v2.z, v3.z, 0.0, 0.0, 0.0, 0.0) * vec4(lum, 0.0);
}

// Gets the difference between 2 4-value luminance vectors
vec4 lum_df(vec4 A, vec4 B) {
    return abs(A - B);
}

// Determines if 2 4-value luminance vectors are "equal" based on threshold
bvec4 lum_eq(vec4 A, vec4 B) {
    return lessThan(lum_df(A, B), threshold);
}

vec4 lum_wd(vec4 a, vec4 b, vec4 c, vec4 d, vec4 e, vec4 f, vec4 g, vec4 h) {
    return lum_df(a, b) + lum_df(a, c) + lum_df(d, e) + lum_df(d, f)
            + 4.0 * lum_df(g, h);
}

// Gets the difference between 2 3-value rgb colors
float c_df(vec3 c1, vec3 c2) {
    vec3 df = abs(c1 - c2);
    return df.r + df.g + df.b;
}

void main() {

    /*
     Mask for algorhithm
     +-----+-----+-----+-----+-----+
     |     |  1  |  2  |  3  |     |
     +-----+-----+-----+-----+-----+
     |  5  |  6  |  7  |  8  |  9  |
     +-----+-----+-----+-----+-----+
     | 10  | 11  | 12  | 13  | 14  |
     +-----+-----+-----+-----+-----+
     | 15  | 16  | 17  | 18  | 19  |
     +-----+-----+-----+-----+-----+
     |     | 21  | 22  | 23  |     |
     +-----+-----+-----+-----+-----+
     */

    float x = rubyTextureFract.x;
    float y = rubyTextureFract.y;

    vec4 xyp_1_2_3 = tc.xxxy + vec4(-x, 0.0, x, -2.0 * y);
    vec4 xyp_6_7_8 = tc.xxxy + vec4(-x, 0.0, x, -y);
    vec4 xyp_11_12_13 = tc.xxxy + vec4(-x, 0.0, x, 0.0);
    vec4 xyp_16_17_18 = tc.xxxy + vec4(-x, 0.0, x, y);
    vec4 xyp_21_22_23 = tc.xxxy + vec4(-x, 0.0, x, 2.0 * y);
    vec4 xyp_5_10_15 = tc.xyyy + vec4(-2.0 * x, -y, 0.0, y);
    vec4 xyp_9_14_9 = tc.xyyy + vec4(2.0 * x, -y, 0.0, y);

    // Get mask values by performing texture lookup with the uniform sampler
    vec3 P1 = texture2D(rubyTexture, xyp_1_2_3.xw).rgb;
    vec3 P2 = texture2D(rubyTexture, xyp_1_2_3.yw).rgb;
    vec3 P3 = texture2D(rubyTexture, xyp_1_2_3.zw).rgb;

    vec3 P6 = texture2D(rubyTexture, xyp_6_7_8.xw).rgb;
    vec3 P7 = texture2D(rubyTexture, xyp_6_7_8.yw).rgb;
    vec3 P8 = texture2D(rubyTexture, xyp_6_7_8.zw).rgb;

    vec3 P11 = texture2D(rubyTexture, xyp_11_12_13.xw).rgb;
    vec3 P12 = texture2D(rubyTexture, xyp_11_12_13.yw).rgb;
    vec3 P13 = texture2D(rubyTexture, xyp_11_12_13.zw).rgb;

    vec3 P16 = texture2D(rubyTexture, xyp_16_17_18.xw).rgb;
    vec3 P17 = texture2D(rubyTexture, xyp_16_17_18.yw).rgb;
    vec3 P18 = texture2D(rubyTexture, xyp_16_17_18.zw).rgb;

    vec3 P21 = texture2D(rubyTexture, xyp_21_22_23.xw).rgb;
    vec3 P22 = texture2D(rubyTexture, xyp_21_22_23.yw).rgb;
    vec3 P23 = texture2D(rubyTexture, xyp_21_22_23.zw).rgb;

    vec3 P5 = texture2D(rubyTexture, xyp_5_10_15.xy).rgb;
    vec3 P10 = texture2D(rubyTexture, xyp_5_10_15.xz).rgb;
    vec3 P15 = texture2D(rubyTexture, xyp_5_10_15.xw).rgb;

    vec3 P9 = texture2D(rubyTexture, xyp_9_14_9.xy).rgb;
    vec3 P14 = texture2D(rubyTexture, xyp_9_14_9.xz).rgb;
    vec3 P19 = texture2D(rubyTexture, xyp_9_14_9.xw).rgb;

    // Store luminance values of each point in groups of 4
    // so that we may operate on all four corners at once
    vec4 p7 = lum_to(P7, P11, P17, P13);
    vec4 p8 = lum_to(P8, P6, P16, P18);
    vec4 p11 = p7.yzwx; // P11, P17, P13, P7
    vec4 p12 = lum_to(P12, P12, P12, P12);
    vec4 p13 = p7.wxyz; // P13, P7,  P11, P17
    vec4 p14 = lum_to(P14, P2, P10, P22);
    vec4 p16 = p8.zwxy; // P16, P18, P8,  P6
    vec4 p17 = p7.zwxy; // P17, P13, P7,  P11
    vec4 p18 = p8.wxyz; // P18, P8,  P6,  P16
    vec4 p19 = lum_to(P19, P3, P5, P21);
    vec4 p22 = p14.wxyz; // P22, P14, P2,  P10
    vec4 p23 = lum_to(P23, P9, P1, P15);

    // Scale current texel coordinate to [0..1]
    vec2 fp = fract(tc * rubyTextureSize);

    // Determine amount of "smoothing" or mixing that could be done on texel corners
    vec4 AiMulFpy = Ai * fp.y;
    vec4 B45MulFpx = B45 * fp.x;
    vec4 ma45 = smoothstep(C45 - M45, C45 + M45, AiMulFpy + B45MulFpx);
    vec4 ma30 = smoothstep(C30 - M30, C30 + M30, AiMulFpy + B30 * fp.x);
    vec4 ma60 = smoothstep(C60 - M60, C60 + M60, AiMulFpy + B60 * fp.x);
    vec4 marn = smoothstep(C45 - M45 + Mshift, C45 + M45 + Mshift,
            AiMulFpy + B45MulFpx);

    // Perform edge weight calculations
    vec4 e45 = lum_wd(p12, p8, p16, p18, p22, p14, p17, p13);
    vec4 econt = lum_wd(p17, p11, p23, p13, p7, p19, p12, p18);
    vec4 e30 = lum_df(p13, p16);
    vec4 e60 = lum_df(p8, p17);

    // Calculate rule results for interpolation
    bvec4 r45_1 = _and_(notEqual(p12, p13), notEqual(p12, p17));
    bvec4 r45_2 = _and_(not (lum_eq(p13, p7)), not (lum_eq(p13, p8)));
    bvec4 r45_3 = _and_(not (lum_eq(p17, p11)), not (lum_eq(p17, p16)));
    bvec4 r45_4_1 = _and_(not (lum_eq(p13, p14)), not (lum_eq(p13, p19)));
    bvec4 r45_4_2 = _and_(not (lum_eq(p17, p22)), not (lum_eq(p17, p23)));
    bvec4 r45_4 = _and_(lum_eq(p12, p18), _or_(r45_4_1, r45_4_2));
    bvec4 r45_5 = _or_(lum_eq(p12, p16), lum_eq(p12, p8));
    bvec4 r45 = _and_(r45_1, _or_(_or_(_or_(r45_2, r45_3), r45_4), r45_5));
    bvec4 r30 = _and_(notEqual(p12, p16), notEqual(p11, p16));
    bvec4 r60 = _and_(notEqual(p12, p8), notEqual(p7, p8));

    // Combine rules with edge weights
    bvec4 edr45 = _and_(lessThan(e45, econt), r45);
    bvec4 edrrn = lessThanEqual(e45, econt);
    bvec4 edr30 = _and_(lessThanEqual(coef * e30, e60), r30);
    bvec4 edr60 = _and_(lessThanEqual(coef * e60, e30), r60);

    // Finalize interpolation rules and cast to float (0.0 for false, 1.0 for true)
    vec4 final45 = vec4(_and_(_and_(not (edr30), not (edr60)), edr45));
    vec4 final30 = vec4(_and_(_and_(edr45, not (edr60)), edr30));
    vec4 final60 = vec4(_and_(_and_(edr45, not (edr30)), edr60));
    vec4 final36 = vec4(_and_(_and_(edr60, edr30), edr45));
    vec4 finalrn = vec4(_and_(not (edr45), edrrn));

    // Determine the color to mix with for each corner
    vec4 px = step(lum_df(p12, p17), lum_df(p12, p13));

    // Determine the mix amounts by combining the final rule result and corresponding
    // mix amount for the rule in each corner
    vec4 mac = final36 * max(ma30, ma60) + final30 * ma30 + final60 * ma60
            + final45 * ma45 + finalrn * marn;

    /*
     Calculate the resulting color by traversing clockwise and counter-clockwise around
     the corners of the texel

     Finally choose the result that has the largest difference from the texel's original
     color
     */
    vec3 res1 = P12;
    res1 = mix(res1, mix(P13, P17, px.x), mac.x);
    res1 = mix(res1, mix(P7, P13, px.y), mac.y);
    res1 = mix(res1, mix(P11, P7, px.z), mac.z);
    res1 = mix(res1, mix(P17, P11, px.w), mac.w);

    vec3 res2 = P12;
    res2 = mix(res2, mix(P17, P11, px.w), mac.w);
    res2 = mix(res2, mix(P11, P7, px.z), mac.z);
    res2 = mix(res2, mix(P7, P13, px.y), mac.y);
    res2 = mix(res2, mix(P13, P17, px.x), mac.x);

    gl_FragColor = vec4(mix(res1, res2, step(c_df(P12, res1), c_df(P12, res2))),
            1.0);
}

The shaders receive a 2D texture and are meant to scale it beautifully across a high-res 2D surface (the device screen). It is an optimization of the SABR scaling algorithm in case it matters.

It already works, and performs OK on very high-end devices (like LG Nexus 4), but it is really slow on weaker devices.

The devices that really matter to me are Samsung Galaxy S 2 \ 3, with Mali 400MP GPU - which perform horribly with this shader.

So far I've tried:

Eliminating varyings (advice from ARM's Mali guide) - did minor improvement.
Overriding mix() functions with my own - did no good.
reducing float precision to lowp - didn't change anything.

I measure performance by calculating render time (before and after eglSwapBuffers) - this gives me a very linear and consistent measurement of performance.

Beyond that, I don't really know where to look or what can be optimized here...

I know that this is a heavy algorithm, and I am not asking for advice on what alternate scaling methods to use - I've tried many and this algorithm gives the best visual result. I wish to use the exact same algorithm in an optimized way.

UPDATE

I found that if I do all the texture fetches with a constant vector instead of dependent vectors I get a major performance improvement, so this is obviously a big bottleneck - probably because of the cache. However, I still need to do those fetches. I played with doing at least some of the fetches with vec2 varyings (without any swizzling) but it didn't improve anything. I wonder what might be a good way to efficiently poll 21 texels.
I found that a major part of the calculations is being done multiple times with the exact same set of texels - because the output is scaled by at least x2, and I poll with GL_NEAREST. There at least 4 fragments that fall on exactly the same texels. If the scaling is x4 on a high-res device, there are 16 fragments that fall on the same texels - which is a big waste. Is there any way to perform an additional shader pass that will calculate all the values that don't change across multiple fragments? I thought about rendering to an additional off-screen texture, but I need to store multiple values per texel, not just one.

UPDATE

Tried to simplify the boolean expressions using known boolean rules - saved me few operations but didn't have any effect on performance.

UPDATE

Thought about a way to pass calculations to the vertex shader - just have a "geometry" that creates my full screen, but with a lot of vertices that correspond to each original pixel before scaling. For example, if my original texture is 320x200 and my target screen is 1280x800, there will be 320x200 vertices spread evenly. Then, do most of the calculations in those vertices. Problem is - my target devices (S2 \ S3) don't support vertex texture sampling.

UPDATE

Measured performance on LG Nexus 4 vs. Samsung Galaxy S3 shows that Nexus 4 runs it more than 10 times faster. How can this be? These are 2 devices from the same generation, same resolution, etc... Could the Mali 400MP be really bad with certain situations? I'm sure there is something very specific that makes it run so slowly compared to Nexus 4 (but didn't find what yet).

解决方案

In my experience mobile GPU performance is roughly proportional to the number of texture2D calls. You have 21 which really is a lot. Generally memory lookups are on the order of hundreds of times slower than calculations, so you can do a lot of calculation and still be bottlenecked on the texture lookups. (This also means optimising the rest of your code probably will have little effect, since it means instead of being busy while it waits for the texture lookups, it will be idle while it waits for the texture lookups.) So you need to reduce the number of texture2Ds you call.

It's difficult to say how to do this since I don't really understand your shader, but some ideas:

separate it in to a horizontal pass then a vertical pass. This only works for some shaders e.g. blurs, but it can seriously reduce the number of texture lookups. For example a 5x5 gaussian blur naively does 25 texture lookups; if done horizontally then vertically, it only uses 10.
use linear filtering to 'cheat'. If you sample exactly between 4 pixels instead of the middle of 1 pixel with linear filtering enabled, you get the average of all 4 pixels for free. However I don't know how it affects your shader. In the blur example again, using linear filtering to sample two pixels at once either side of the middle pixel allows you to sample 5 pixels with 3 texture2D calls, reducing the 5x5 blur to just 6 samples for both horizontal and vertical.
just use a smaller kernel (so you don't take so many samples). This affects the quality, so you'd probably want some way to detect the device performance and switch to a lower quality shader when the device appears to be slow.

这篇关于沉重的片段着色器的性能优化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

沉重的片段着色器的性能优化 [英] Optimizing performance of a heavy fragment shader

问题描述

相关文章

移动开发最新文章

热门教程

热门工具

登录关闭

沉重的片段着色器的性能优化 [英] Optimizing performance of a heavy fragment shader

问题描述

相关文章

移动开发最新文章

热门教程

热门工具

登录 关闭

登录关闭