通过将float放入int变量来进行内联ASM舍入的优点 [英] Merit of inline-ASM rounding via putting float into int variable

查看:84
本文介绍了通过将float放入int变量来进行内联ASM舍入的优点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我继承了一段非常有趣的代码:

  inline int round(浮点a)
{
int i;
__asm {
fd a
fistp i
}
return i;
}

我的第一个冲动就是丢弃它,并用(int)std :: round (C ++ 11之前的版本,如果今天发生,将使用 std :: lround ),但之后一阵子我开始怀疑它到底是否有优点……






此功能的用例是 [-100,100] 中的所有值,因此即使 int8_t 也足够宽以容纳结果。 fistp 至少需要32位内存变量,因此少于 int32_t 浪费得更多。 / p>

现在,很明显,将float转换为int并不是最快的处理方式,因为舍入模式必须切换为 truncate (按照标准),然后再返回。 C ++ 11提供了 std :: lround 函数,该函数可以缓解此特定问题,但考虑到该值可以通过float-> long-来实现,但这似乎仍然更加浪费。 > p,而不是直接到达应该的位置。



另一方面,在函数中使用inline-ASM时,编译器无法优化 i 放入寄存器中(即使可能, fistp 期望有一个内存变量),所以 std :: lround 似乎还不算太糟...



我最紧迫的问题是假设的安全性(此函数确实如此),舍入模式将始终为最近舍入,这很明显(没有检查)。由于 std :: lround 必须保证某种行为,而与舍入模式无关,因此只要保持这种假设,似乎总是使内联ASM成为更好的选择



此外,我还不清楚 std :: fesetround 设置的舍入模式是否由 std :: lround 替代 std :: lrint 以及 fistp 保证ASM指令相同或至少是同步的。






这些是我的考虑因素,也就是我不知道在保留或替换功能上做出明智的决定。



现在有这样的问题:






在对这些注意事项或我未曾想到的考虑有更全面的了解之后,似乎建议使用此功能吗? p>

风险有多大?



存在为什么无法比 std :: lround std :: lrint 更快的理由?



可以在没有性能成本的情况下进一步改进吗?



可以吗?如果该程序是针对x86-64编译的,是否有任何这种推理上的改变?

解决方案

TL; DR :使用 lrintf(x)(int)nearbyintf(x),具体取决于您的哪一个编译器更喜欢。



检查asm,以查看在SSE4.1可用时有哪些内联(例如 -march = nehalem 或penryn或更高版本),无论是否包含 -ffast-math 。您可能需要 -fno-math-errno 有时才能使GCC内联,但无论如何都应使用clang内联。除非您确实期望 lrintf sqrtf 或其他数学函数设置,否则这是100%安全的errno ,通常与 -fno-trapping-math 一起推荐。






在可以避免使用内联汇编时不要使用它。编译器不会理解它的作用,因此他们无法通过它进行优化。例如如果将该函数内联到某个位置,使其参数成为编译时常量,则它仍将 fld 常量和 fistp 将其存储到内存,然后将其加载回整数寄存器。 Pure C将使编译器仅传播 mov r32,imm32 的常数,或者进一步传播该常数并将其折叠成其他东西。更不用说CSE,并且将转换提升为循环。 ( MSVC内联asm不允许您指定asm块是一个纯函数,仅在需要输出值且不依赖于全局值时才需要运行。。GNU C内联asm确实允许该部分,但这仍然是一个不好的选择为此,因为它对编译器不透明。)



GCC Wiki甚至有关于此主题的页面,解释了与我上一段相同的内容(以及更多内容),因此内联汇编绝对应该是最后的选择。



在这种情况下,我们可以使编译器从纯C发出良好的代码,因此我们绝对应该这样做。



Float-> int仅需要一条机器指令(请参见下文),但是诀窍是让编译器将其发出(并且仅发出)。使数学库函数内联可能很棘手,因为其中某些函数必须设置errno和/或在某些情况下引发不精确的异常。 ( -fno-math-errno 可以提供帮助,如果您不能使用完整的 -ffast-math 或相当于MSVC)



对于某些编译器(gcc但不是clang), lrintf 很好 。不过,这并不理想: float -> long -> int 的大小不相同,直接不同于 int 的大小。 x86-64 SystemV ABI(Windows以外的所有其他设备都使用)具有64位



64位 long 更改 lrint 的溢出语义:而不是获取 0x80000000 (在x86上)使用SSE指令),您将获得 long 的低32位(如果值超出的范围,则将为全零)长)。



lrintf 不会自动矢量化(除非编译器可以证明浮点数将在范围内),因为只有标量指令而不是SIMD指令可以转换 float s或 double 转换为打包的64位整数(直到AVX512DQ )。 C数学库函数的IDK可直接转换为 int ,但是您可以使用(int)nearbyintf(x) ,使用64位代码可以更轻松地自动矢量化。有关gcc和clang的处理效果,请参见以下部分。



除了击败自动矢量化技术外,没有直接的速度损失。 cvtss2si rax,xmm0 在任何现代微体系结构上(请参阅 Agner Fog的insn表)。只需为REX前缀花费一个额外的指令字节。



在AArch64(又名ARM64)上 gcc4.8编译 round 转换为单个 fcvtas x0,s0 指令,所以我猜ARM64在硬件中提供了这种时髦的舍入模式(但x86没有)。奇怪的是, -ffast-math 使得内联函数更少,但这是笨拙的旧gcc4.8。对于ARM(非64),即使使用 -mfloat-abi = hard -mhard-float -march = armv7-a ,gcc4.8也不会内联任何内容。也许这些不是正确的选择; IDK ARM非常好:/



如果要进行大量转换,则可以使用SSE / AVX内在函数手动为x86进行向量化, _mm_cvtps_epi32 cvtps2dq ),甚至可以将生成的32位整数元素压缩为16或8位(使用 packssdw 。但是,使用纯C编译器可以自动矢量化是一个很好的选择)计划,因为它是便携式的。






lrintf



  #include< math.h> 
int round_to_nearest(float f){//默认模式总是最接近
return lrintf(f);
}

所述Godbolt编译器资源管理器

  ###########不带-ffast-math ###### ####### 
cvtss2si eax,xmm0#gcc 6.1(-O3 -mx32,这么长就是32bit)

cvtss2si rax,xmm0#gcc 4.4至6.1(-O3) 。但是无法自动向量化。

jmp lrintf#clang 3.8(-O3 -msse4.1),仍在尾部调用函数:/

###### -ffast-math# ########
jmp lrintf#clang 3.8(-O3 -msse4.1 -ffast-math)

很明显,clang不能很好地完成它,但是即使是古老的gcc也很棒,即使没有 -ffast-math 也能很好地完成工作。






不要使用 roundf / lroundf :它没有-标准的舍入语义(半数情况下的值从0开始,而不是平均)。 这会导致x86 asm变差,但实际上会导致ARM64 asm变好。那么也许要做可以将其用于ARM吗?但是,它确实具有固定的舍入行为,而不是使用当前的舍入模式。



如果您希望返回值作为 float ,而不是转换为int,最好使用 nearbyintf rint 必须在输出!=输入时引发FP不精确异常。 (但是SSE4.1 rounds 可以使用其直接控制字节的第3位来实现任一行为。)






直接将 nearbyint()截断为 int



  #include< math.h> 
int round_to_nearest(f浮点数){
return附近的intf(f);
}

来自 the Godbolt编译器浏览器

  ######## -ffast-math ##### ####### 
cvtss2si eax,xmm0#gcc 4.8到6.1(-O3 -fast-math)

#lang哑且不会将回合折入cvt 。如果不使用sse4.1,则是一个函数调用
四舍五入xmm0,xmm0、12#clang 3.5到3.8(-O3 -fast-math -msse4.1)
cvttss2si eax,xmm0

四舍五入xmm1,xmm0,12#ICC13(-O3 -msse4.1 -ffast-math)
cvtss2si eax,xmm1

#######不带-ffast-数学############
sub rsp,8
呼叫附近的intf#gcc 6.1(-O3 -msse4.1)
添加rsp,8#和clang -msse4.1
cvttss2si eax,xmm0

roundss xmm0,xmm0,12#clang3.2及更高版本(-O3 -msse4.1)
cvttss2si eax,xmm0

舍入xmm1,xmm0,12#ICC13(-O3 -msse4.1)
cvtss2si eax,xmm1

Gcc 4.7及更早版本:仅 cvttss2si 而没有 -msse4.1 ,但发出如果有SSE4.1,则进行舍入。它的Nearestint定义必须使用inline-asm,因为ins-syntax输出中的asm语法已损坏。






它是如何工作的asm




现在,很明显,将float转换为int并不是最快的处理方式,因为舍入模式必须是


只有当您针对使用20年的CPU,这才是正确的没有上证所。 (您说的是 float ,而不是 double ,所以我们只需要SSE,而不是SSE2。没有SSE2的最早的CPU是Athlon XP)。



现代系统在xmm寄存器中执行浮点运算。 SSE指示将标量浮点数转换为带截断的带符号整数( cvttss2si 并使用当前计数模式( cvtss2si 。 (请注意,第一个截断需要额外的 t 。其余的助记符是将标量单精度转换为有符号整数。)<$ c $有类似的说明。 c> double ,x86-64允许目标为64位整数寄存器。



另请参见标签Wiki。



cvtss2si 之所以存在,是因为C的默认行为是将float转换为int。更改舍入模式的速度很慢,因此英特尔提供了一种不会造成麻烦的方法。



我认为即使32位版本的现代Windows也需要足够新的硬件才能拥有SSE2,以防万一。 (SSE2是AMD64 ISA的一部分,并且64位调用约定甚至在xmm寄存器中传递了 float / double args)


I have inherited a pretty interesting piece of code:

inline int round(float a)
{
  int i;
  __asm {
    fld   a
    fistp i
  }
  return i;
}

My first impulse was to discard it and replace calls with (int)std::round (pre-C++11, would use std::lround if it happened today), but after a while I started to wonder if it might have some merit after all...


The use case for this function are all values in [-100, 100], so even int8_t would be wide enough to hold the result. fistp requires at least a 32 bit memory variable, however, so less than int32_t is just as wasted as more.

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards. C++11 offers the std::lround function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.

On the other hand, with inline-ASM in the function, the compiler cannot optimise away i into a register (and even if it could, fistp expects a memory variable), so std::lround does not seem too much worse...

The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest, as it obviously does (no checks). As std::lround has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.

It is furthermore highly unclear to me whether the rounding mode set by std::fesetround and used by the std::lround alternative std::lrint and the rounding mode employed in the fistp ASM-instruction are guaranteed to be the same or at least synchronous.


These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.

Now to the questions:


Following a more informed view of these considerations or such which I have not thought of, does it seem advisable to use this function?

How great is the risk, if any?

Does reasoning exist for why it would not be faster than std::lround or std::lrint?

Can it be further improved without performance cost?

Does any of this reasoning change if the program were compiled for x86-64?

解决方案

TL;DR: use lrintf(x) or (int)nearbyintf(x), depending on which one your compiler likes better.

Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem or penryn, or later), with or without -ffast-math. You may need -fno-math-errno to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf or sqrtf or other math functions to set errno, and is generally recommended along with -fno-trapping-math.


Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld a constant and fistp it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).

The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.

In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.

Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno can help, if you can't use the full -ffast-math or the MSVC equivalent)

With some compilers (gcc but not clang), lrintf is good. It isn't ideal, though: float->long->int isn't the same as directly to int when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long.

64bit long changes the overflow semantics for lrint: instead of getting 0x80000000 (on x86 with SSE instructions), you'll get the low 32bits of the long (which will be all-zero if the value was outside the range of a long).

This lrintf won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert floats or double to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int, but you can use (int)nearbyintf(x), which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.

Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0 on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.

On AArch64 (aka ARM64), gcc4.8 compiles lround into a single fcvtas x0, s0 instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a. Maybe those aren't the right options; IDK ARM very well :/

If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32 (cvtps2dq), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.


lrintf

#include <math.h>
int round_to_nearest(float f) {  // default mode is always nearest
  return lrintf(f);
}

Compiler output from the Godbolt Compiler explorer:

       ########### Without -ffast-math #############
    cvtss2si        eax, xmm0    # gcc 6.1  (-O3 -mx32, so long is 32bit)

    cvtss2si        rax, xmm0    # gcc 4.4 through 6.1  (-O3).  can't auto-vectorize, though.

    jmp     lrintf               # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/

             ###### With -ffast-math #########
    jmp     lrintf               # clang 3.8 (-O3 -msse4.1 -ffast-math)

So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math.


Don't use roundf/lroundf: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.

If you want the return value as a float, instead of converting to int, it may be better to use nearbyintf. rint has to raise the FP inexact exception when output != input. (But SSE4.1 roundss can implement either behaviour with bit 3 of its immediate control byte).


truncating nearbyint() to int directly.

#include <math.h>
int round_to_nearest(float f) {
  return nearbyintf(f);
}

Compiler output from the Godbolt Compiler explorer.

        ########  With -ffast-math ############
    cvtss2si        eax, xmm0      # gcc 4.8 through 6.1 (-O3 -ffast-math)

    # clang is dumb and won't fold the roundss into the cvt.  Without sse4.1, it's a function call
    roundss xmm0, xmm0, 12         # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12      # ICC13 (-O3 -msse4.1 -ffast-math)
    cvtss2si  eax, xmm1

        ######## WITHOUT -ffast-math ############
    sub     rsp, 8
    call    nearbyintf                    # gcc 6.1 (-O3 -msse4.1)
    add     rsp, 8                        # and clang without -msse4.1
    cvttss2si       eax, xmm0

    roundss xmm0, xmm0, 12               # clang3.2 and later (-O3 -msse4.1)
    cvttss2si       eax, xmm0

    roundss   xmm1, xmm0, 12             # ICC13 (-O3 -msse4.1)
    cvtss2si  eax, xmm1

Gcc 4.7 and earlier: Just cvttss2si without -msse4.1, but emits a roundss if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.


How it works in asm

Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.

That's only true if you're targeting 20-year-old CPUs without SSE. (You said float, not double, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).

Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si) or with the current counting mode (cvtss2si). (Note the extra t for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double, and x86-64 allows the destination to be a 64bit integer register.

See also the tag wiki.

cvtss2si basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.

I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float / double args in xmm registers).

这篇关于通过将float放入int变量来进行内联ASM舍入的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆