HLSL分支规避 [英] HLSL branch avoidance

查看:189
本文介绍了HLSL分支规避的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个着色器,我想在其中移动顶点着色器中的一半顶点.我正在尝试从性能的角度决定执行此操作的最佳方法,因为我们处理的转换次数超过100,000,因此速度至关重要.我研究了3种不同的方法:(伪代码,但足以让您理解.<complex formula>我无法给出,但是我可以说它涉及到sin()函数,以及函数调用(只是返回一个数字,但仍然是一个函数调用),以及一堆有关浮点数的基本算术运算.

I have a shader where I want to move half of the vertices in the vertex shader. I'm trying to decide the best way to do this from a performance standpoint, because we're dealing with well over 100,000 verts, so speed is critical. I've looked at 3 different methods: (pseudo-code, but enough to give you the idea. The <complex formula> I can't give out, but I can say that it involves a sin() function, as well as a function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic on floating point numbers).

if (y < 0.5)
{
    x += <complex formula>;
}

这样的优点是<complex formula>仅执行一半的时间,但是缺点是它肯定会导致分支,实际上它可能比公式慢.它是最易读的,但是在这种情况下,我们更关心速度而不是可读性.

This has the advantage that the <complex formula> is only executed half the time, but the downside is that it definitely causes a branch, which may actually be slower than the formula. It is the most readable, but we care more about speed than readability in this context.

x += step(y, 0.5) * <complex formula>;

使用HLSL的step()函数(如果第一个参数较大则返回0,如果较小则返回1),可以消除该分支,但是现在每次都调用<complex formula>,并且其结果乘以一半的时间为0(因此浪费了精力).

Using HLSL's step() function (which returns 0 if the first param is greater and 1 if less), you can eliminate the branch, but now the <complex formula> is being called every time, and its results are being multiplied by 0 (thus wasted effort) half of the time.

x += (y < 0.5) ? <complex formula> : 0;

这我不知道. ?:是否引起分支?如果没有,那么方程的两边还是仅相关的那一边?

This I don't know about. Does the ?: cause a branch? And if not, are both sides of the equation evaluated or only the one that is relevant?

最后的可能性是<complex formula>可以卸载到CPU而不是GPU上,但是我担心它在计算sin()和其他操作时会变慢,这可能会导致净损失.而且,这意味着必须再将一个数字传递给着色器,这也可能导致开销.任何人都对哪种方法最好是有所了解?

The final possibility is that the <complex formula> could be offloaded back to the CPU instead of the GPU, but I worry that it will be slower in calculating sin() and other operations, which might result in a net loss. Also, it means one more number has to be passed to the shader, and that could cause overhead as well. Anyone have any insight as to which would be the best course of action?

附录:

根据 http://msdn.microsoft .com/en-us/library/windows/desktop/bb509665%28v = vs.85%29.aspx

step()函数在内部使用了?:,因此它可能不比我的第3个解决方案好,并且可能更糟,因为<complex formula>每次都会被调用,而可能仅用直的?:叫了一半的时间. (到目前为止,还没有人回答问题的那一部分.)尽管避免两者并用:

the step() function uses a ?: internally, so it's probably no better than my 3rd solution, and potentially worse since <complex formula> is definitely called every time, whereas it may be only called half the time with a straight ?:. (Nobody's answered that part of the question yet.) Though avoiding both and using:

x += (1.0 - y) * <complex formula>;

可能比其中任何一个都要好,因为在任何地方都无法进行比较. (并且y始终为0或1.)仍然不必要地执行<complex formula>的一半时间,但可能值得完全避免分支.

may well be better than any of them, since there's no comparison being made anywhere. (And y is always either 0 or 1.) Still executes the <complex formula> needlessly half the time, but might be worth it to avoid branches altogether.

推荐答案

也许看看此答案.

我的猜测(这是一个性能问题:测量它!)是,最好不要使用if语句.

My guess (this is a performance question: measure it!) is that you are best off keeping the if statement.

第一原因:从理论上讲(如果正确调用)着色器编译器,应该足够聪明,以便在编译if时在分支指令和类似于step函数的东西之间做出最佳选择. >声明.对此进行改进的唯一方法是分析 [1] .请注意,在此粒度级别上,它可能与硬件有关.

Reason number one: The shader compiler, in theory (and if invoked correctly), should be clever enough to make the best choice between a branch instruction, and something similar to the step function, when it compiles your if statement. The only way to improve on it is to profile[1]. Note that it's probably hardware-dependent at this level of granularity.

[1]或者,如果您对数据的布局有特定的了解,请继续阅读...

[1] Or if you have specific knowledge about how your data is laid out, read on...

第二个原因是着色器单元的工作方式:如果单元中甚至一个片段或顶点都与另一个片段或顶点具有不同的分支,则着色器单元必须同时具有两个分支.但是,如果它们都采用相同的分支-则忽略另一个分支.因此,尽管它是按单位而不是按顶点的-仍然有可能跳过昂贵的分支.

Reason number two is the way shader units work: If even one fragment or vertex in the unit takes a different branch to the others, then the shader unit must take both branches. But if they all take the same branch - the other branch is ignored. So while it is per-unit, rather than per-vertex - it is still possible for the expensive branch to be skipped.

对于片段,着色器单元在屏幕上具有局部性-意味着您可以在附近的像素组都占据同一分支的情况下获得最佳性能(请参阅我的

For fragments, the shader units have on-screen locality - meaning you get best performance with groups of nearby pixels all taking the same branch (see the illustration in my linked answer). To be honest, I don't know how vertices are grouped into units - but if your data is grouped appropriately - you should get the desired performance benefit.

最后:值得指出的是,您的<complex formula>-如果您说可以从HLSL中手动将其吊起-无论如何,它很可能已吊起到基于CPU的预着色器中(至少在PC上) ,从内存中Xbox 360不支持此功能,对PS3完全不了解).您可以通过反编译着色器进行检查.如果只需要对每个图形(而不是每个顶点/片段)进行一次计算,那么 可能是在CPU上实现性能的最佳选择.

Finally: It's worth pointing out that your <complex formula> - if you're saying that you can hoist it out of your HLSL manually - it may well get hoisted into a CPU-based pre-shader anyway (on PC at least, from memory Xbox 360 doesn't support this, no idea about PS3). You can check this by decompiling the shader. If it is something that you only need to calculate once per-draw (rather than per-vertex/fragment) it probably is best for performance to do it on the CPU.

这篇关于HLSL分支规避的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆