为什么编译器生成编译的程序集code额外sqrts [英] Why does compiler generate additional sqrts in the compiled assembly code
问题描述
我试图来分析它需要使用下面这个简单的C code,其中readTSC()是一个函数读取CPU的周期计数器来计算平方根的时间。
双总和= 0.0;
INT I;
TM = readTSC();
对于(i = 0; I< N;我++)
总和+ =开方((双)一);
TM = readTSC() - TM;
的printf(%LLD钟总\\ n,TM);
的printf(%15.6e \\ n,总和);
不过,我用打印出来组装code
GCC -S timing.c -o timing.s
英特尔机上,结果(如下所示)是令人惊讶的
为什么有两个sqrts在装配code与一个使用 sqrtsd
指令和其他使用函数调用?是这涉及到循环展开,并试图在一次迭代执行两个sqrts?
和如何理解行
ucomisd%XMM0,%XMM0
为什么它比较%XMM0
来自己?
// ----------------的for循环开始----------------
调用readTSC
MOVQ%RAX,-32(RBP%)
MOVL $ 0 -4(RBP%)
JMP .L4
.L6:
cvtsi2sd -4(RBP%),%将xmm1
// 1.使用sqrtsd指令
sqrtsd%将xmm1,%XMM0
ucomisd%XMM0,%XMM0
JP .L8
JE .L5
.L8:
MOVAPD%将xmm1,%XMM0
// 2.采用C调用功能可按
调用sqrt
.L5:
MOVSD -16(RBP%),%将xmm1
addsd%将xmm1,%XMM0
MOVSD%XMM0,-16(RBP%)
ADDL $ 1,-4(RBP%)
.L4:
MOVL -4(RBP%),%EAX
CMPL -36(RBP%),%EAX
JL .L6
结束// ----------------循环----------------
调用readTSC
它使用库开方
错误处理功能。作为一种优化,它首先试图通过内联 sqrtsd
指令来执行的平方根,然后利用检查结果对自身的 ucomisd
指令用于设置标志如下:
CASE(和结果)的
无序:ZF,PF,CF 111;
GREATER_THAN:ZF,PF,CF 000;
LESS_THAN:ZF,PF,CF 001;
EQUAL:ZF,PF,CF 100;
ESAC;
块引用>在特定的,比较一个
原来的QNaN
来自己将返回无序
,也就是如果你尝试,你会得到什么取负数的平方根。这是由JP
分支覆盖。在JE
检查只是妄想,检查精确的平等。另外请注意,GCC有一个
-ffast-数学
选项,它会牺牲这个错误处理速度。I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter.
double sum = 0.0; int i; tm = readTSC(); for ( i = 0; i < n; i++ ) sum += sqrt((double) i); tm = readTSC() - tm; printf("%lld clocks in total\n",tm); printf("%15.6e\n",sum);
However, as I printed out the assembly code using
gcc -S timing.c -o timing.s
on an Intel machine, the result (shown below) was surprising?
Why there are two sqrts in the assembly code with one using the
sqrtsd
instruction and the other using a function call? Is it related to loop unrolling and trying to execute two sqrts in one iteration?And how to understand the line
ucomisd %xmm0, %xmm0
Why does it compare
%xmm0
to itself?//----------------start of for loop---------------- call readTSC movq %rax, -32(%rbp) movl $0, -4(%rbp) jmp .L4 .L6: cvtsi2sd -4(%rbp), %xmm1 // 1. use sqrtsd instruction sqrtsd %xmm1, %xmm0 ucomisd %xmm0, %xmm0 jp .L8 je .L5 .L8: movapd %xmm1, %xmm0 // 2. use C funciton call call sqrt .L5: movsd -16(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -16(%rbp) addl $1, -4(%rbp) .L4: movl -4(%rbp), %eax cmpl -36(%rbp), %eax jl .L6 //----------------end of for loop---------------- call readTSC
解决方案It's using the library
sqrt
function for error handling. As an optimization, it first tries to perform the square root by the inlinedsqrtsd
instruction, then checks the result against itself using theucomisd
instruction which sets the flags as follows:CASE (RESULT) OF UNORDERED: ZF,PF,CF 111; GREATER_THAN: ZF,PF,CF 000; LESS_THAN: ZF,PF,CF 001; EQUAL: ZF,PF,CF 100; ESAC;
In particular, comparing a
QNaN
to itself will returnUNORDERED
, which is what you will get if you try to take the square root of a negative number. This is covered by thejp
branch. Theje
check is just paranoia, checking for exact equality.Also note that gcc has a
-ffast-math
option which will sacrifice this error handling for speed.这篇关于为什么编译器生成编译的程序集code额外sqrts的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!