通过将 float 放入 int 变量进行内联 ASM 舍入的优点 [英] Merit of inline-ASM rounding via putting float into int variable

查看:18
本文介绍了通过将 float 放入 int 变量进行内联 ASM 舍入的优点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我继承了一段非常有趣的代码:

I have inherited a pretty interesting piece of code:

inline int round(float a)
{
  int i;
  __asm {
    fld   a
    fistp i
  }
  return i;
}

我的第一个冲动是丢弃它并用 (int)std::round 替换调用(在 C++11 之前,将使用 std::lround 如果它发生在今天),但过了一段时间我开始怀疑它到底是否有一些优点......

My first impulse was to discard it and replace calls with (int)std::round (pre-C++11, would use std::lround if it happened today), but after a while I started to wonder if it might have some merit after all...

此函数的用例都是 [-100, 100] 中的所有值,因此即使 int8_t 也足够宽以容纳结果.fistp 至少需要一个 32 位的内存变量,但是,小于 int32_t 与更多一样浪费.

The use case for this function are all values in [-100, 100], so even int8_t would be wide enough to hold the result. fistp requires at least a 32 bit memory variable, however, so less than int32_t is just as wasted as more.

现在,很明显将 float 转换为 int 并不是最快的方法,至于舍入模式必须按照标准切换到 truncate,然后再返回.C++11 提供了 std::lround 函数,它缓解了这个特定的问题,但似乎仍然更浪费,考虑到值通过 float->long->int 而不是直接到达应该在哪里.

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards. C++11 offers the std::lround function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.

另一方面,在函数中使用内联 ASM 时,编译器无法将 i 优化到寄存器中(即使可以,fistp 也需要一个内存变量),所以 std::lround 看起来并没有太糟糕...

On the other hand, with inline-ASM in the function, the compiler cannot optimise away i into a register (and even if it could, fistp expects a memory variable), so std::lround does not seem too much worse...

然而,我遇到的最紧迫的问题是假设舍入模式将始终是 round-to-nearest 的安全性如何(正如此函数所做的那样),正如它显然所做的那样(不检查).由于 std::lround 必须保证某种行为独立于舍入模式,这个假设,只要它成立,似乎总是使内联 ASM 舍入成为更好的选择.

The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest, as it obviously does (no checks). As std::lround has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.

此外,我还非常不清楚 std::fesetround 设置的舍入模式是否由 std::lround 替代 std::lrint 使用fistp ASM 指令中采用的舍入模式保证相同或至少是同步的.

It is furthermore highly unclear to me whether the rounding mode set by std::fesetround and used by the std::lround alternative std::lrint and the rounding mode employed in the fistp ASM-instruction are guaranteed to be the same or at least synchronous.

这些是我的考虑,也就是我不知道的关于保留或替换该功能的明智决定.

These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.

现在回答问题:

在对这些考虑因素或其他我没有想到的更明智的看法之后,使用此功能似乎是可取的?

风险有多大(如果有)?

是否有理由解释为什么它不会比 std::lroundstd::lrint 快?

Does reasoning exist for why it would not be faster than std::lround or std::lrint?

能否在不牺牲性能的情况下进一步改进?

如果程序是为 x86-64 编译的,这种推理是否会改变?

推荐答案

TL;DR: use lrintf(x) or (int)nearbyintf(x)),取决于你的编译器更喜欢哪一个.

TL;DR: use lrintf(x) or (int)nearbyintf(x), depending on which one your compiler likes better.

检查 asm 以查看当 SSE4.1 可用时(例如 -march=nehalem 或 penryn,或更高版本),有或没有 -ffast-math.有时您可能需要 -fno-math-errno 来让 GCC 内联,但无论如何都会内联.这是 100% 安全的,除非您确实期望 lrintfsqrtf 或其他数学函数来设置 errno,并且通常建议与 一起使用-fno-trapping-math.

Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem or penryn, or later), with or without -ffast-math. You may need -fno-math-errno to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf or sqrtf or other math functions to set errno, and is generally recommended along with -fno-trapping-math.

在可能避免的情况下不要使用内联汇编.编译器不理解"它的作用,因此他们无法通过它进行优化.例如如果该函数被内联到某个地方使其参数成为编译时常量,它仍然会fld 一个常量并且fistp 将它存入内存,然后将其加载回整数寄存器.纯 C 会让编译器传播常量并且只传播 mov r32, imm32,或者进一步传播常量并将其折叠成其他东西.更不用说 CSE,以及将转换提升到循环之外.(MSVC 内联 asm 不允许您指定asm 块是一个纯函数,只有在需要输出值时才需要运行,并且它不依赖于全局.GNU C 内联 asm 确实允许该部分,但它仍然是一个糟糕的选择因为它对编译器不透明).

Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld a constant and fistp it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).

GCC wiki 甚至有一个关于这个主题的页面,解释了与我相同的内容上一段(以及更多),因此内联汇编绝对应该是最后的手段.

The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.

在这种情况下,我们可以让编译器从纯 C 中生成好的代码,所以我们绝对应该这样做.

Float->int 使用当前舍入模式只需要一条机器指令(见下文),但诀窍是让编译器发出它(并且只发出它).使数学库函数内联可能很棘手,因为其中一些必须设置 errno 和/或在某些情况下引发不准确的异常.(-fno-math-errno 可以提供帮助,如果您不能使用完整的 -ffast-math 或 MSVC 等效项)

Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno can help, if you can't use the full -ffast-math or the MSVC equivalent)

使用一些编译器(gcc 但不是 clang),lrintf 很好.不过,这并不理想:float->long->int 与直接到 int 当它们的大小不同时.x86-64 SystemV ABI(除 Windows 外的所有系统都使用)具有 64 位 long.

With some compilers (gcc but not clang), lrintf is good. It isn't ideal, though: float->long->int isn't the same as directly to int when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long.

64 位 long 更改了 lrint 的溢出语义:而不是获取 0x80000000(在带有 SSE 指令的 x86 上),您将获得long 的低 32 位(如果值超出 long 的范围,则全为零).

64bit long changes the overflow semantics for lrint: instead of getting 0x80000000 (on x86 with SSE instructions), you'll get the low 32bits of the long (which will be all-zero if the value was outside the range of a long).

这个 lrintf 不会自动矢量化(除非编译器可以证明浮点数在范围内),因为只有标量,而不是 SIMD,指令来转换 floats 或 double 到打包的 64 位整数 (直到 AVX512DQ).C 数学库函数的 IDK 可直接转换为 int,但您可以使用 (int)nearbyintf(x),它在 64 位代码中更容易自动矢量化.请参阅下面的部分,了解 gcc 和 clang 在这方面做得如何.

This lrintf won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert floats or double to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int, but you can use (int)nearbyintf(x), which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.

除了打败自动矢量化之外,cvtss2si rax, xmm0 在任何现代微架构上都没有直接的速度损失(请参阅 Agner Fog 的insn 表).REX 前缀只需要一个额外的指令字节.

Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0 on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.

在 AArch64(又名 ARM64)上 gcc4.8编译lround 到单个 fcvtas x0, s0 指令,所以我猜 ARM64 在硬件中提供了这种时髦的舍入模式(但 x86 没有).奇怪的是,-ffast-math 使内联函数更少,但那是笨重的旧 gcc4.8.对于 ARM(不是 64),gcc4.8 不会内联任何内容,即使使用 -mfloat-abi=hard -mhard-float -march=armv7-a.也许这些不是正确的选择;IDK ARM 非常好:/

On AArch64 (aka ARM64), gcc4.8 compiles lround into a single fcvtas x0, s0 instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a. Maybe those aren't the right options; IDK ARM very well :/

如果您有很多转换要做,您可以使用 SSE/AVX 内在函数为 x86 手动矢量化,_mm_cvtps_epi32 (cvtps2dq),甚至将生成的 32 位整数元素压缩到 16 位或 8 位(使用 packssdw.但是,使用编译器可以自动矢量化的纯 C 是一个很好的计划,因为它是可移植的.

If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32 (cvtps2dq), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.

#include <math.h>
int round_to_nearest(float f) {  // default mode is always nearest
  return lrintf(f);
}

编译器输出来自 所述Godbolt编译器资源管理器:

       ########### Without -ffast-math #############
    cvtss2si        eax, xmm0    # gcc 6.1  (-O3 -mx32, so long is 32bit)

    cvtss2si        rax, xmm0    # gcc 4.4 through 6.1  (-O3).  can't auto-vectorize, though.

    jmp     lrintf               # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/

             ###### With -ffast-math #########
    jmp     lrintf               # clang 3.8 (-O3 -msse4.1 -ffast-math)

很明显,clang 不能很好地处理它,但即使是古老的 gcc 也很棒,即使没有 -ffast-math 也能很好地工作.

So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math.

不要使用roundf/lroundf:它具有非标准的舍入语义(半数情况下远离 0,而不是偶数).这导致 x86 asm 更差,但实际上 ARM64 asm 更好.那么也许 do 将它用于 ARM?不过,它确实具有固定的舍入行为,而不是使用当前的舍入模式.

Don't use roundf/lroundf: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.

如果您希望返回值作为 float,而不是转换为 int,最好使用nearbyintf.rint 必须在 output != input 时引发 FP inexact 异常.(但 SSE4.1 roundss 可以通过其直接控制字节的第 3 位实现任一行为).

If you want the return value as a float, instead of converting to int, it may be better to use nearbyintf. rint has to raise the FP inexact exception when output != input. (But SSE4.1 roundss can implement either behaviour with bit 3 of its immediate control byte).

#include <math.h>
int round_to_nearest(float f) {
  return nearbyintf(f);
}

编译器输出来自 Godbolt 编译器浏览器.

        ########  With -ffast-math ############
    cvtss2si        eax, xmm0      # gcc 4.8 through 6.1 (-O3 -ffast-math)

    # clang is dumb and won't fold the roundss into the cvt.  Without sse4.1, it's a function call
    roundss xmm0, xmm0, 12         # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12      # ICC13 (-O3 -msse4.1 -ffast-math)
    cvtss2si  eax, xmm1

        ######## WITHOUT -ffast-math ############
    sub     rsp, 8
    call    nearbyintf                    # gcc 6.1 (-O3 -msse4.1)
    add     rsp, 8                        # and clang without -msse4.1
    cvttss2si       eax, xmm0

    roundss xmm0, xmm0, 12               # clang3.2 and later (-O3 -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12             # ICC13 (-O3 -msse4.1)
    cvtss2si  eax, xmm1

Gcc 4.7 及更早版本:仅 cvttss2si 没有 -msse4.1,但如果 SSE4.1 可用,则发出 roundss.它的 nearint 定义必须使用 inline-asm,因为 asm 语法在 intel-syntax 输出中被破坏.可能这就是它被插入然后在意识到它正在转换为 int 时没有优化掉的方式.

Gcc 4.7 and earlier: Just cvttss2si without -msse4.1, but emits a roundss if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.

现在,很明显,将 float 转换为 int 并不是最快的方法,至于必须按照标准将舍入模式切换为截断,然后再返回.

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.

这仅适用于没有 SSE 的 20 年历史 CPU.(你说的是float,不是double,所以我们只需要SSE,不需要SSE2.最老的没有SSE2的CPU是Athlon XP).

That's only true if you're targeting 20-year-old CPUs without SSE. (You said float, not double, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).

现代系统在 xmm 寄存器中进行浮点运算.SSE 有将 标量浮点数转换为带截断的有符号整数的说明(cvttss2si)使用当前计数模式 (cvtss2si).(注意第一个中 Truncate 的额外 t.其余的助记符是 Convert Scalar Single-precision To Signed Integer.)double 也有类似的指令,和x86-64 允许目标是 64 位整数寄存器.

Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si) or with the current counting mode (cvtss2si). (Note the extra t for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double, and x86-64 allows the destination to be a 64bit integer register.

另请参阅 标签维基.

See also the x86 tag wiki.

cvtss2si 基本上存在是因为 C 将 float 转换为 int 的默认行为.更改舍入模式很慢,因此英特尔提供了一种不会出错的方法.

cvtss2si basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.

我认为即使是 32 位版本的现代 Windows 也需要足够新的硬件来拥有 SSE2,以防这对任何人都很重要.(SSE2 是 AMD64 ISA 的一部分,64 位调用约定甚至在 xmm 寄存器中传递 float/double args).

I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float / double args in xmm registers).

这篇关于通过将 float 放入 int 变量进行内联 ASM 舍入的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆