避免调用 floor() [英] Avoiding Calls to floor()

查看:32
本文介绍了避免调用 floor()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一段代码,我需要处理不一定在 0 到 1 范围内的 uvs(2D 纹理坐标).例如,有时我会得到一个 u 分量为 1.2 的 uv.为了解决这个问题,我正在通过执行以下操作来实现一个导致平铺的包装:

I am working on a piece of code where I need to deal with uvs (2D texture coordinates) that are not necessarily in the 0 to 1 range. As an example, sometimes I will get a uv with a u component that is 1.2. In order to handle this I am implementing a wrapping which causes tiling by doing the following:

u -= floor(u)
v -= floor(v)

这样做会导致 1.2 变成 0.2,这就是预期的结果.它还处理负面情况,例如 -0.4 变为 0.6.

Doing this causes 1.2 to become 0.2 which is the desired result. It also handles negative cases, such as -0.4 becoming 0.6.

然而,这些对地板的调用相当缓慢.我已经使用英特尔 VTune 分析了我的应用程序,并且我在执行此楼层操作时花费了大量周期.

However, these calls to floor are rather slow. I have profiled my application using Intel VTune and I am spending a huge amount of cycles just doing this floor operation.

在对这个问题进行了一些背景阅读后,我想出了以下函数,它速度更快,但仍有很多不足之处(我仍然会招致类型转换惩罚等).

Having done some background reading on the issue, I have come up with the following function which is a bit faster but still leaves a lot to be desired (I am still incurring type conversion penalties, etc).

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

我已经看到一些通过内联汇编完成的技巧,但没有任何技巧似乎完全正确或有任何显着的速度改进.

I have seen a few tricks that are accomplished with inline assembly but nothing that seems to work exactly correct or have any significant speed improvement.

有没有人知道处理这种情况的任何技巧?

Does anyone know any tricks for handling this kind of scenario?

推荐答案

老问题,但我遇到了它,让我有点抽搐,它没有得到令人满意的回答.

Old question, but I came across it and it made me convulse slightly that it hasn't been satisfactorily answered.

TL;DR:*不要**为此使用内联汇编、内部函数或任何其他给定的解决方案!相反,使用快速/不安全的数学优化(g++ 中的-ffast-math -funsafe-math-optimizations -fno-math-errno")进行编译.floor() 如此慢的原因是因为如果强制转换溢出,它会改变全局状态(FLT_MAX 不适合任何大小的标量整数类型),这也使得无法矢量化,除非您禁用严格的 IEEE-754 兼容性,您可能无论如何都不应该依赖它.使用这些标志进行编译会禁用问题行为.

TL;DR: *Don't** use inline assembly, intrinsics, or any of the other given solutions for this! Instead, compile with fast/unsafe math optimizations ("-ffast-math -funsafe-math-optimizations -fno-math-errno" in g++). The reason why floor() is so slow is because it changes global state if the cast would overflow (FLT_MAX does not fit in a scalar integer type of any size), which also makes it impossible to vectorize unless you disable strict IEEE-754 compatibility, which you should probably not rely on anyway. Compiling with these flags disables the problem behavior.

一些说明:

  1. 带有标量寄存器的内联汇编不可矢量化,这在通过优化进行编译时极大地抑制了性能.它还要求当前存储在向量寄存器中的任何相关值溢出到堆栈并重新加载到标量寄存器中,这违背了手动优化的目的.

  1. inline assembly with scalar registers is not vectorizable, which drastically inhibits performance when compiling with optimizations. It also requires that any relevant values currently stored in vector registers be spilled to the stack and reloaded into scalar registers, which defeats the purpose of hand-optimization.

在我的机器上使用 SSE cvttss2si 和您概述的方法进行内联汇编实际上比使用编译器优化的简单 for 循环慢.这可能是因为如果您允许编译器将整个代码块矢量化在一起,它会分配寄存器并更好地避免管道停顿.对于像这样的一小段代码,内部依赖链很少,几乎没有寄存器溢出的可能性,它比被 asm( 包围的手工优化代码) 做得更糟糕的可能性很小.

Inline assembly using the SSE cvttss2si with the method you've outlined is actually slower on my machine than a simple for loop with compiler optimizations. This is likely because your compiler will allocate registers and avoid pipeline stalls better if you allow it to vectorize whole blocks of code together. For a short piece of code like this with few internal dependent chains and almost no chance of register spillage it has very little chance to do worse than hand-optimized code surrounded by asm().

内联程序集不可移植,在 Visual Studio 64 位版本中不受支持,并且非常难以阅读.内在函数与上面列出的警告相同.

Inline assembly is unportable, unsupported in Visual Studio 64-bit builds, and insanely hard to read. Intrinsics suffer from the same caveats as well as the ones listed above.

所有其他列出的方法都是不正确的,这可以说比缓慢更糟糕,而且它们在每种情况下都提供了如此微不足道的性能改进,以至于不能证明该方法的粗糙性是合理的.(int)(x+16.0)-16.0 太糟糕了,我什至不会碰它,但你的方法也是错误的,因为它把 floor(-1) 设为 -2.在数学代码中包含分支也是一个非常糟糕的主意,因为它对性能至关重要,以至于标准库无法为您完成这项工作.所以你的(不正确的)方式应该看起来更像 ((int) x) - (x<0.0),也许有一个中间的,所以你不必执行两次 fpu 移动.分支会导致缓存未命中,这将完全抵消性能的任何提高;此外,如果 math errno 被禁用,则转换为 int 是任何 floor() 实现的最大剩余瓶颈.如果您/真​​的/不关心获得负整数的正确值,这可能是一个合理的近似值,但除非您非常了解您的用例,否则我不会冒险.

All the other listed ways are simply incorrect, which is arguably worse than being slow, and they give in each case such a marginal performance improvement that it doesn't justify the coarseness of the approach. (int)(x+16.0)-16.0 is so bad I won't even touch it, but your method is also wrong because it gives floor(-1) as -2. It's also a very bad idea to include branches in math code when it's so performance critical that the standard library won't do the job for you. So your (incorrect) way should look more like ((int) x) - (x<0.0), maybe with an intermediate so you don't have to perform the fpu move twice. Branches can cause a cache miss, which will completely negate any increase in performance; also, if math errno is disabled, then casting to int is the biggest remaining bottleneck of any floor() implementation. If you /really/ don't care about getting correct values for negative integers, it may be a reasonable approximation, but I wouldn't risk it unless you know your use case very well.

我尝试使用按位转换和舍入通过位掩码,就像 SUN 的 newlib 实现在 fmodf 中所做的那样,但需要很长时间才能正确,并且在我的机器上慢了几倍,即使没有相关的编译器优化标志.很可能,他们为一些古老的 CPU 编写了这些代码,在这些 CPU 中,浮点运算相对来说非常昂贵,而且没有向量扩展,更不用说向量转换操作了;AFAIK 的任何常见架构都不再是这种情况.SUN 也是 Quake 3 使用的快速逆 sqrt() 例程的发源地;现在在大多数架构上都有相关说明.微优化的最大缺陷之一是它们很快就会过时.

I tried using bitwise casting and rounding-via-bitmask, like what SUN's newlib implementation does in fmodf, but it took a very long time to get right and was several times slower on my machine, even without the relevant compiler optimization flags. Very likely, they wrote that code for some ancient CPU where floating point operations were comparatively very expensive and there were no vector extensions, let alone vector conversion operations; this is no longer the case on any common architectures AFAIK. SUN is also the birthplace of the fast inverse sqrt() routine used by Quake 3; there is now an instruction for that on most architectures. One of the biggest pitfalls of micro-optimizations is that they become outdated quickly.

这篇关于避免调用 floor()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆