通过将float放入int变量来进行内联ASM舍入的优点 [英] Merit of inline-ASM rounding via putting float into int variable
问题描述
我继承了一段非常有趣的代码:
inline int round(浮点a)
{
int i;
__asm {
fd a
fistp i
}
return i;
}
我的第一个冲动就是丢弃它,并用(int)std :: round
(C ++ 11之前的版本,如果今天发生,将使用 std :: lround
),但之后一阵子我开始怀疑它到底是否有优点……
此功能的用例是 [-100,100]
中的所有值,因此即使 int8_t
也足够宽以容纳结果。 fistp
至少需要32位内存变量,因此少于 int32_t
浪费得更多。 / p>
现在,很明显,将float转换为int并不是最快的处理方式,因为舍入模式必须切换为 truncate
(按照标准),然后再返回。 C ++ 11提供了 std :: lround
函数,该函数可以缓解此特定问题,但考虑到该值可以通过float-> long-来实现,但这似乎仍然更加浪费。 > p,而不是直接到达应该的位置。
另一方面,在函数中使用inline-ASM时,编译器无法优化 i
放入寄存器中(即使可能, fistp
期望有一个内存变量),所以 std :: lround
似乎还不算太糟...
我最紧迫的问题是假设的安全性(此函数确实如此),舍入模式将始终为最近舍入
,这很明显(没有检查)。由于 std :: lround
必须保证某种行为,而与舍入模式无关,因此只要保持这种假设,似乎总是使内联ASM成为更好的选择
此外,我还不清楚 std :: fesetround
设置的舍入模式是否由 std :: lround
替代 std :: lrint
以及 fistp
保证ASM指令相同或至少是同步的。
这些是我的考虑因素,也就是我不知道在保留或替换功能上做出明智的决定。
现在有这样的问题:
在对这些注意事项或我未曾想到的考虑有更全面的了解之后,似乎建议使用此功能吗? p>
风险有多大?
存在为什么无法比 std :: lround
或 std :: lrint
更快的理由?
可以在没有性能成本的情况下进一步改进吗?
可以吗?如果该程序是针对x86-64编译的,是否有任何这种推理上的改变?
TL; DR :使用 lrintf(x)
或(int)nearbyintf(x)
,具体取决于您的哪一个编译器更喜欢。
检查asm,以查看在SSE4.1可用时有哪些内联(例如 -march = nehalem
或penryn或更高版本),无论是否包含 -ffast-math
。您可能需要 -fno-math-errno
有时才能使GCC内联,但无论如何都应使用clang内联。除非您确实期望 lrintf
或 sqrtf
或其他数学函数设置,否则这是100%安全的errno
,通常与 -fno-trapping-math
一起推荐。
在可以避免使用内联汇编时不要使用它。编译器不会理解它的作用,因此他们无法通过它进行优化。例如如果将该函数内联到某个位置,使其参数成为编译时常量,则它仍将 fld
常量和 fistp
将其存储到内存,然后将其加载回整数寄存器。 Pure C将使编译器仅传播 mov r32,imm32
的常数,或者进一步传播该常数并将其折叠成其他东西。更不用说CSE,并且将转换提升为循环。 ( MSVC内联asm不允许您指定asm块是一个纯函数,仅在需要输出值且不依赖于全局值时才需要运行。。GNU C内联asm确实允许该部分,但这仍然是一个不好的选择为此,因为它对编译器不透明。)
GCC Wiki甚至有关于此主题的页面,解释了与我上一段相同的内容(以及更多内容),因此内联汇编绝对应该是最后的选择。
在这种情况下,我们可以使编译器从纯C发出良好的代码,因此我们绝对应该这样做。
Float-> int仅需要一条机器指令(请参见下文),但是诀窍是让编译器将其发出(并且仅发出)。使数学库函数内联可能很棘手,因为其中某些函数必须设置errno和/或在某些情况下引发不精确的异常。 ( -fno-math-errno
可以提供帮助,如果您不能使用完整的 -ffast-math
或相当于MSVC)
对于某些编译器(gcc但不是clang), lrintf
很好 。不过,这并不理想: float
-> long
-> int
的大小不相同,直接不同于 int
的大小。 x86-64 SystemV ABI(Windows以外的所有其他设备都使用)具有64位长
。
64位 long
更改 lrint
的溢出语义:而不是获取 0x80000000
(在x86上)使用SSE指令),您将获得 long
的低32位(如果值超出的范围,则将为全零)长
)。
此 lrintf
不会自动矢量化(除非编译器可以证明浮点数将在范围内),因为只有标量指令而不是SIMD指令可以转换 float
s或 double
转换为打包的64位整数(直到AVX512DQ )。 C数学库函数的IDK可直接转换为 int
,但是您可以使用(int)nearbyintf(x)
,使用64位代码可以更轻松地自动矢量化。有关gcc和clang的处理效果,请参见以下部分。
除了击败自动矢量化技术外,没有直接的速度损失。 cvtss2si rax,xmm0
在任何现代微体系结构上(请参阅 Agner Fog的insn表)。只需为REX前缀花费一个额外的指令字节。
在AArch64(又名ARM64)上, gcc4.8编译 round
转换为单个 fcvtas x0,s0
指令,所以我猜ARM64在硬件中提供了这种时髦的舍入模式(但x86没有)。奇怪的是, -ffast-math
使得内联函数更少,但这是笨拙的旧gcc4.8。对于ARM(非64),即使使用 -mfloat-abi = hard -mhard-float -march = armv7-a
,gcc4.8也不会内联任何内容。也许这些不是正确的选择; IDK ARM非常好:/
如果要进行大量转换,则可以使用SSE / AVX内在函数手动为x86进行向量化,像 _mm_cvtps_epi32
( cvtps2dq
),甚至可以将生成的32位整数元素压缩为16或8位(使用 packssdw
。但是,使用纯C编译器可以自动矢量化是一个很好的选择)计划,因为它是便携式的。
lrintf
#include< math.h>
int round_to_nearest(float f){//默认模式总是最接近
return lrintf(f);
}
###########不带-ffast-math ###### #######
cvtss2si eax,xmm0#gcc 6.1(-O3 -mx32,这么长就是32bit)
cvtss2si rax,xmm0#gcc 4.4至6.1(-O3) 。但是无法自动向量化。
jmp lrintf#clang 3.8(-O3 -msse4.1),仍在尾部调用函数:/
###### -ffast-math# ########
jmp lrintf#clang 3.8(-O3 -msse4.1 -ffast-math)
很明显,clang不能很好地完成它,但是即使是古老的gcc也很棒,即使没有 -ffast-math
也能很好地完成工作。
不要使用 roundf
/ lroundf
:它没有-标准的舍入语义(半数情况下的值从0开始,而不是平均)。 这会导致x86 asm变差,但实际上会导致ARM64 asm变好。那么也许要做可以将其用于ARM吗?但是,它确实具有固定的舍入行为,而不是使用当前的舍入模式。
如果您希望返回值作为 float
,而不是转换为int,最好使用 nearbyintf
。 rint
必须在输出!=输入时引发FP不精确异常。 (但是SSE4.1 rounds
可以使用其直接控制字节的第3位来实现任一行为。)
直接将 nearbyint()
截断为 int
。
#include< math.h>
int round_to_nearest(f浮点数){
return附近的intf(f);
}
######## -ffast-math ##### #######
cvtss2si eax,xmm0#gcc 4.8到6.1(-O3 -fast-math)
#lang哑且不会将回合折入cvt 。如果不使用sse4.1,则是一个函数调用
四舍五入xmm0,xmm0、12#clang 3.5到3.8(-O3 -fast-math -msse4.1)
cvttss2si eax,xmm0
四舍五入xmm1,xmm0,12#ICC13(-O3 -msse4.1 -ffast-math)
cvtss2si eax,xmm1
#######不带-ffast-数学############
sub rsp,8
呼叫附近的intf#gcc 6.1(-O3 -msse4.1)
添加rsp,8#和clang -msse4.1
cvttss2si eax,xmm0
roundss xmm0,xmm0,12#clang3.2及更高版本(-O3 -msse4.1)
cvttss2si eax,xmm0
舍入xmm1,xmm0,12#ICC13(-O3 -msse4.1)
cvtss2si eax,xmm1
Gcc 4.7及更早版本:仅 cvttss2si
而没有 -msse4.1
,但发出如果有SSE4.1,则进行舍入
。它的Nearestint定义必须使用inline-asm,因为ins-syntax输出中的asm语法已损坏。
它是如何工作的asm
现在,很明显,将float转换为int并不是最快的处理方式,因为舍入模式必须是
只有当您针对使用20年的CPU,这才是正确的没有上证所。 (您说的是 float
,而不是 double
,所以我们只需要SSE,而不是SSE2。没有SSE2的最早的CPU是Athlon XP)。
现代系统在xmm寄存器中执行浮点运算。 SSE指示将标量浮点数转换为带截断的带符号整数( cvttss2si
)或并使用当前计数模式( cvtss2si
)。 (请注意,第一个截断需要额外的 t
。其余的助记符是将标量单精度转换为有符号整数。)<$ c $有类似的说明。 c> double ,x86-64允许目标为64位整数寄存器。
另请参见 x86 标签Wiki。
cvtss2si
之所以存在,是因为C的默认行为是将float转换为int。更改舍入模式的速度很慢,因此英特尔提供了一种不会造成麻烦的方法。
我认为即使32位版本的现代Windows也需要足够新的硬件才能拥有SSE2,以防万一。 (SSE2是AMD64 ISA的一部分,并且64位调用约定甚至在xmm寄存器中传递了 float
/ double
args)
I have inherited a pretty interesting piece of code:
inline int round(float a)
{
int i;
__asm {
fld a
fistp i
}
return i;
}
My first impulse was to discard it and replace calls with (int)std::round
(pre-C++11, would use std::lround
if it happened today), but after a while I started to wonder if it might have some merit after all...
The use case for this function are all values in [-100, 100]
, so even int8_t
would be wide enough to hold the result. fistp
requires at least a 32 bit memory variable, however, so less than int32_t
is just as wasted as more.
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate
, as per the standard, and back afterwards. C++11 offers the std::lround
function, which alleviates this particular issue, but still does seem to be more wasteful, considering that the value passes float->long->int instead of directly arriving where it should.
On the other hand, with inline-ASM in the function, the compiler cannot optimise away i
into a register (and even if it could, fistp
expects a memory variable), so std::lround
does not seem too much worse...
The most pressing question I have is however how safe it is to assume (as this function does), that the rounding mode will always be round-to-nearest
, as it obviously does (no checks). As std::lround
has to guarantee a certain behaviour independent of rounding mode, this assumption, as long as it holds, always seems to make the inline-ASM round the better option.
It is furthermore highly unclear to me whether the rounding mode set by std::fesetround
and used by the std::lround
alternative std::lrint
and the rounding mode employed in the fistp
ASM-instruction are guaranteed to be the same or at least synchronous.
These are my considerations, aka what I do not know to make an informed decision on retaining or replacing the function.
Now to the questions:
Following a more informed view of these considerations or such which I have not thought of, does it seem advisable to use this function?
How great is the risk, if any?
Does reasoning exist for why it would not be faster than std::lround
or std::lrint
?
Can it be further improved without performance cost?
Does any of this reasoning change if the program were compiled for x86-64?
TL;DR: use lrintf(x)
or (int)nearbyintf(x)
, depending on which one your compiler likes better.
Check the asm to see which one inlines when SSE4.1 is available (e.g. -march=nehalem
or penryn, or later), with or without -ffast-math
. You may need -fno-math-errno
to get GCC to inline sometimes, but clang inline anyway. This is 100% safe unless you actually expect lrintf
or sqrtf
or other math functions to set errno
, and is generally recommended along with -fno-trapping-math
.
Don't use inline asm when you can possibly avoid it. Compilers don't "understand" what it does, so they can't optimize through it. e.g. If that function is inlined somewhere that makes its argument a compile-time constant, it will still fld
a constant and fistp
it to memory, then load that back into an integer register. Pure C will let the compiler propagate the constant and just mov r32, imm32
, or further propagate the constant and fold it into something else. Not to mention CSE, and hoisting the conversion out of a loop. (MSVC inline asm doesn't let you specify that an asm block is a pure function, and only needs to be run if the output value is needed, and that it doesn't depend on a global. GNU C inline asm does allow that part, but it's still a bad choice for this because it's not transparent to the compiler).
The GCC wiki even has a page on this subject, explaining the same things as my previous paragraph (and more), so inline asm should definitely be a last resort.
In this case, we can get the compiler to emit good code from pure C, so we should absolutely do that.
Float->int with the current rounding mode only takes a single machine instruction (see below), but the trick is to get a compiler to emit it (and only it). Getting math-library functions to inline can be tricky, because some of them have to set errno and/or raise an inexact exception in certain cases. (-fno-math-errno
can help, if you can't use the full -ffast-math
or the MSVC equivalent)
With some compilers (gcc but not clang), lrintf
is good. It isn't ideal, though: float
->long
->int
isn't the same as directly to int
when they're not the same size. The x86-64 SystemV ABI (used by everything except Windows) has 64bit long
.
64bit long
changes the overflow semantics for lrint
: instead of getting 0x80000000
(on x86 with SSE instructions), you'll get the low 32bits of the long
(which will be all-zero if the value was outside the range of a long
).
This lrintf
won't auto-vectorize (unless maybe the compiler can prove that the floats will be in-range), because there are only scalar, not SIMD, instructions to convert float
s or double
to packed 64bit integers (until AVX512DQ). IDK of a C math library function to convert directly to int
, but you can use (int)nearbyintf(x)
, which does auto-vectorize more easily in 64bit code. See the section below for how well gcc and clang do with that.
Other than defeating auto-vectorization, though, there's no direct speed penalty for cvtss2si rax, xmm0
on any modern microarchitecture (see Agner Fog's insn tables). It just costs an extra instruction byte for the REX prefix.
On AArch64 (aka ARM64), gcc4.8 compiles lround
into a single fcvtas x0, s0
instruction, so I guess ARM64 provides that funky rounding mode in hardware (but x86 doesn't). Strangely, -ffast-math
makes fewer functions inline, but that's with clunky old gcc4.8. For ARM (not 64), gcc4.8 doesn't inline anything, even with -mfloat-abi=hard -mhard-float -march=armv7-a
. Maybe those aren't the right options; IDK ARM very well :/
If you have a lot of conversions to do, you can manually vectorize for x86 with SSE / AVX intrinsics, like _mm_cvtps_epi32
(cvtps2dq
), and even pack the resulting 32bit integer elements down to 16 or 8 bit (with packssdw
. However, using pure C that the compiler can auto-vectorize is a good plan, because it's portable.
lrintf
#include <math.h>
int round_to_nearest(float f) { // default mode is always nearest
return lrintf(f);
}
Compiler output from the Godbolt Compiler explorer:
########### Without -ffast-math #############
cvtss2si eax, xmm0 # gcc 6.1 (-O3 -mx32, so long is 32bit)
cvtss2si rax, xmm0 # gcc 4.4 through 6.1 (-O3). can't auto-vectorize, though.
jmp lrintf # clang 3.8 (-O3 -msse4.1), still tail-calls the function :/
###### With -ffast-math #########
jmp lrintf # clang 3.8 (-O3 -msse4.1 -ffast-math)
So clearly clang doesn't do well with it, but even ancient gcc is great, and does a good job even without -ffast-math
.
Don't use roundf
/lroundf
: it has non-standard rounding semantics (halfway cases away from 0, instead of to even). This leads to worse x86 asm, but actually better ARM64 asm. So maybe do use it for ARM? It does have fixed rounding behaviour, though, instead of using the current rounding mode.
If you want the return value as a float
, instead of converting to int, it may be better to use nearbyintf
. rint
has to raise the FP inexact exception when output != input. (But SSE4.1 roundss
can implement either behaviour with bit 3 of its immediate control byte).
truncating nearbyint()
to int
directly.
#include <math.h>
int round_to_nearest(float f) {
return nearbyintf(f);
}
Compiler output from the Godbolt Compiler explorer.
######## With -ffast-math ############
cvtss2si eax, xmm0 # gcc 4.8 through 6.1 (-O3 -ffast-math)
# clang is dumb and won't fold the roundss into the cvt. Without sse4.1, it's a function call
roundss xmm0, xmm0, 12 # clang 3.5 to 3.8 (-O3 -ffast-math -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1 -ffast-math)
cvtss2si eax, xmm1
######## WITHOUT -ffast-math ############
sub rsp, 8
call nearbyintf # gcc 6.1 (-O3 -msse4.1)
add rsp, 8 # and clang without -msse4.1
cvttss2si eax, xmm0
roundss xmm0, xmm0, 12 # clang3.2 and later (-O3 -msse4.1)
cvttss2si eax, xmm0
roundss xmm1, xmm0, 12 # ICC13 (-O3 -msse4.1)
cvtss2si eax, xmm1
Gcc 4.7 and earlier: Just cvttss2si
without -msse4.1
, but emits a roundss
if SSE4.1 is available. It's nearbyint definition must be using inline-asm, because the asm syntax is broken in intel-syntax output. Probably this is how it gets inserted and then not optimized away when it realizes it's converting to int.
How it works in asm
Now, quite obviously casting the float to int is not the fastest way to do things, as for that the rounding mode has to be switched to truncate, as per the standard, and back afterwards.
That's only true if you're targeting 20-year-old CPUs without SSE. (You said float
, not double
, so we only need SSE, not SSE2. The oldest CPUs without SSE2 are Athlon XP).
Modern system do floating point in xmm registers. SSE has instructions to convert a scalar float to signed int with truncation (cvttss2si
) or with the current counting mode (cvtss2si
). (Note the extra t
for Truncate in the first one. The rest of the mnemonic is Convert Scalar Single-precision To Signed Integer.) There are similar instructions for double
, and x86-64 allows the destination to be a 64bit integer register.
See also the x86 tag wiki.
cvtss2si
basically exists because of C's default behaviour for casting float to int. Changing the rounding mode is slow, so Intel provided a way to do it that doesn't suck.
I think even 32bit versions of modern Windows requires hardware new enough to have SSE2, in case that matters to anyone. (SSE2 is part of the AMD64 ISA, and the 64bit calling conventions even pass float
/ double
args in xmm registers).
这篇关于通过将float放入int变量来进行内联ASM舍入的优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!