uint64_t与int64_t的sqrt [英] sqrt of uint64_t vs. int64_t
问题描述
我注意到,计算uint64_t
的平方根的整数部分比int64_t
复杂得多.拜托,有人对此有解释吗?为什么似乎很难处理多一点?
以下内容:
int64_t sqrt_int(int64_t a) {
return sqrt(a);
}
使用clang 5.0和-mfpmath=sse -msse3 -Wall -O3
编译为
sqrt_int(long): # @sqrt_int(long)
cvtsi2sd xmm0, rdi
sqrtsd xmm0, xmm0
cvttsd2si rax, xmm0
ret
但以下内容:
uint64_t sqrt_int(uint64_t a) {
return sqrt(a);
}
编译为:
.LCPI0_0:
.long 1127219200 # 0x43300000
.long 1160773632 # 0x45300000
.long 0 # 0x0
.long 0 # 0x0
.LCPI0_1:
.quad 4841369599423283200 # double 4503599627370496
.quad 4985484787499139072 # double 1.9342813113834067E+25
.LCPI0_2:
.quad 4890909195324358656 # double 9.2233720368547758E+18
sqrt_int(unsigned long): # @sqrt_int(unsigned long)
movq xmm0, rdi
punpckldq xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
subpd xmm0, xmmword ptr [rip + .LCPI0_1]
haddpd xmm0, xmm0
sqrtsd xmm0, xmm0
movsd xmm1, qword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero
movapd xmm2, xmm0
subsd xmm2, xmm1
cvttsd2si rax, xmm2
movabs rcx, -9223372036854775808
xor rcx, rax
cvttsd2si rax, xmm0
ucomisd xmm0, xmm1
cmovae rax, rcx
ret
首先,您需要清楚一点,这段代码将64位整数(有符号或无符号)转换为双精度浮点数,并取平方根,然后将结果转换回有符号或无符号整数.
您的问题的答案是因为Intel在您要编译的指令集中提供了64位有符号整数到双精度浮点转换(反之亦然),但对于无符号情况则没有这样做.他们在AVX-512中添加了无符号转换指令,但是在此之前不存在.因此,对于带符号的情况,到双精度的转换和向后的转换每个都是一条指令.对于无符号的情况,编译器必须从可用指令中合成转换操作.
您可以在此处获取有关哪些指令可用,哪些版本的SSE2/AVX/AVX-512等的信息: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ >
您可以在此处查看有关用于综合转换的方法的讨论: 是否存在x87 FILD的无符号等效项和SSE CVTSI2SD指令?
I noticed that calculating the integer part of square root of uint64_t
is much more complicated than of int64_t
. Please, does anybody have an explanation for this? Why is it seemingly much more difficult to deal with one extra bit?
The following:
int64_t sqrt_int(int64_t a) {
return sqrt(a);
}
compiles with clang 5.0 and -mfpmath=sse -msse3 -Wall -O3
to
sqrt_int(long): # @sqrt_int(long)
cvtsi2sd xmm0, rdi
sqrtsd xmm0, xmm0
cvttsd2si rax, xmm0
ret
But the following:
uint64_t sqrt_int(uint64_t a) {
return sqrt(a);
}
compiles to:
.LCPI0_0:
.long 1127219200 # 0x43300000
.long 1160773632 # 0x45300000
.long 0 # 0x0
.long 0 # 0x0
.LCPI0_1:
.quad 4841369599423283200 # double 4503599627370496
.quad 4985484787499139072 # double 1.9342813113834067E+25
.LCPI0_2:
.quad 4890909195324358656 # double 9.2233720368547758E+18
sqrt_int(unsigned long): # @sqrt_int(unsigned long)
movq xmm0, rdi
punpckldq xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
subpd xmm0, xmmword ptr [rip + .LCPI0_1]
haddpd xmm0, xmm0
sqrtsd xmm0, xmm0
movsd xmm1, qword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero
movapd xmm2, xmm0
subsd xmm2, xmm1
cvttsd2si rax, xmm2
movabs rcx, -9223372036854775808
xor rcx, rax
cvttsd2si rax, xmm0
ucomisd xmm0, xmm1
cmovae rax, rcx
ret
First off, you need to be clear that this code is converting 64-bit integers (signed or unsigned) to double precision floating point, taking the square root, and then casting the result back to a signed or unsigned integer.
The answer to your question is because Intel provided 64-bit signed integer to double precision floating-point conversion (and the opposite) in the instruction set you are compiling for, but did not do so for the unsigned case. They added the unsigned conversion instruction in AVX-512, but it does not exist prior to that. So for the signed case, the conversion to double precision and the conversion back are one instruction each. For the unsigned case, the compiler has to synthesize the conversion operation from available instructions.
You can get information on which instructions are available in which versions of SSE2/AVX/AVX-512, etc. here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
You can see discussion of the method used to synthesize the conversion here: Are there unsigned equivalents of the x87 FILD and SSE CVTSI2SD instructions?
这篇关于uint64_t与int64_t的sqrt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!