着色器中的分支效率 [英] Efficiency of branching in shaders

查看:75
本文介绍了着色器中的分支效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题似乎没有什么根据,但是如果有人对该主题有理论/实践经验,那么与您分享它会很棒。

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.

我正在尝试优化我的一个旧着色器,该着色器使用了大量纹理查找。

I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.

我已经分散了,法线镜面贴图,三个可能的贴图平面中的每个贴图以及一些靠近用户的脸部,我还必须应用贴图技术,这还会带来很多纹理查找(例如视差遮挡贴图)。

I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).

分析表明纹理查找是着色器的瓶颈,我愿意删除其中一些。对于某些输入参数的情况,我已经知道部分纹理查找是不必要的,并且显而易见解决方案是执行(伪代码)

Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):

if (part_actually_needed) {
   perform lookups;
   perform other steps specific for THIS PART;
}

// All other parts.

现在-问题来了。

我记不清楚了(这就是为什么我说这个问题可能没有根据的原因),但是我最近在某些论文中读过(不幸的是,我不记得了名称)类似于以下内容:

I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:


显示的$ b的性能 $ b技术取决于 基于硬件的条件性
分支
的实施效率。

The performance of the presented technique depends on how efficient the HARDWARE-BASED CONDITIONAL BRANCHING is implemented.

我记得在开始重构大量着色器并实现 if 之前的这种声明。

I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.

所以-就在我开始这样做之前-有人对效率有所了解吗?着色器中的分支?为什么在着色器中分支会严重影响性能?

So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?

如果,我什至可能只用来降低实际性能

And is it even possible that I could only worsen the actual performance with the if-based branching?

您可能会说-尝试一下。 / strong>是的,如果这里没有人帮助我,那就是我要做的事情:)

You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)

但是, if 案例对于新的GPU可能是有效的,对于一些较旧的GPU来说可能是一场噩梦。 除非您有很多不同的GPU(不是我的情况),否则很难预测这种问题

But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)

,如果有人对此有所了解或具有基准测试经验,我将非常感谢您的帮助。

实际上剩余的剩余脑细胞不断告诉我,在GPU上进行分支可能远不如对CPU进行分支(后者通常具有非常有效的分支预测方式并消除缓存未命中)有效,原因仅在于GPU(或者可能很难/不可能在GPU上实现)。

Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).

不幸的是,我不确定此语句是否与实际情况有共同之处...

推荐答案

不幸的是,我认为真正的答案是使用性能分析仪进行实际测试您的特定情况,在目标硬件上。特别是听起来好像您正在项目优化阶段;这是考虑到硬件经常更改以及特定着色器的性质的唯一方法。

Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.

在CPU上,如果分支预测错误,会导致流水线刷新,并且由于CPU流水线太深,您实际上会损失大约20个或更多周期的东西。在GPU上有些不同。管道的深度可能会很浅,但是没有分支预测,并且所有着色器代码都将存储在快速存储器中-但这并不是真正的区别。

On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.

由于nVidia和ATI相对保守,所以很难知道发生的所有事情的确切细节,但是关键是GPU是为大规模并行而制造的执行。有许多异步着色器内核,但是每个内核都被设计为运行多个线程。我的理解是,每个内核都希望在任何给定周期内在其所有线程上运行相同的指令(nVidia将此线程集合称为 warp)。

It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").

在这种情况下,线可能代表一个顶点,一个几何元素或一个像素/片段,而经线则是其中约32个的集合。对于像素,它们很可能是屏幕上彼此靠近的像素。问题是,如果在一个warp中,不同的线程在条件跳转时做出不同的决定,那么warp就会发散并且不再为每个线程运行相同的指令。硬件可以处理此问题,但目前尚不清楚(至少对我而言)。以后每代卡片的处理方式也可能略有不同。最新,最通用的CUDA /计算着色器友好的nVidias可能具有最佳的实现。较旧的卡可能执行较差。更糟糕的情况是,您可能会发现许多线程在if / else语句的两侧执行。

In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.

着色器的一大技巧是学习如何利用这种大规模并行范例。有时,这意味着需要使用额外的传递,临时的屏幕外缓冲区和模板缓冲区,以将逻辑从着色器中推入CPU。有时优化可能会消耗更多的周期,但实际上可能会减少一些隐藏的开销。

One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.

还请注意,您可以将DirectX着色器中的if语句显式标记为[branch ]或[展平]。扁平化样式可为您提供正确的结果,但始终执行指令中的所有内容。如果您未明确选择一个,则编译器可以为您选择一个-并可以选择[flatten],这对您的示例不利。

Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.

一件事要记住的是,如果您跳过第一个纹理查找,这将使硬件的纹理坐标导数数学混乱。您会遇到编译器错误,最好不要这样做,否则您可能会错过一些更好的纹理支持。

One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.

这篇关于着色器中的分支效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆