为什么编译器生成编译的程序集code额外sqrts [英] Why does compiler generate additional sqrts in the compiled assembly code

查看:233
本文介绍了为什么编译器生成编译的程序集code额外sqrts的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图来分析它需要使用下面这个简单的C code,其中readTSC()是一个函数读取CPU的周期计数器来计算平方根的时间。

 双总和= 0.0;
INT I;
TM = readTSC();
对于(i = 0; I< N;我++)
   总和+ =开方((双)一);
TM = readTSC() - TM;
的printf(%LLD钟总\\ n,TM);
的printf(%15.6e \\ n,总和);

不过,我用打印出来组装code

  GCC -S timing.c -o timing.s

英特尔机上,结果(如下所示)是令人惊讶的

为什么有两个sqrts在装配code与一个使用 sqrtsd 指令和其他使用函数调用?是这涉及到循环展开,并试图在一次迭代执行两个sqrts?

和如何理解行

  ucomisd%XMM0,%XMM0

为什么它比较%XMM0 来自己?

  // ----------------的for循环开始----------------
调用readTSC
MOVQ%RAX,-32(RBP%)
MOVL $ 0 -4(RBP%)
JMP .L4
.L6:
cvtsi2sd -4(RBP%),%将xmm1
// 1.使用sqrtsd指令
sqrtsd%将xmm1,%XMM0
ucomisd%XMM0,%XMM0
JP .L8
JE .L5
.L8:
MOVAPD%将xmm1,%XMM0
// 2.采用C调用功能可按
调用sqrt
.L5:
MOVSD -16(RBP%),%将xmm1
addsd%将xmm1,%XMM0
MOVSD%XMM0,-16(RBP%)
ADDL $ 1,-4(RBP%)
.L4:
MOVL -4(RBP%),%EAX
CMPL -36(RBP%),%EAX
JL .L6
结束// ----------------循环----------------
调用readTSC


解决方案

它使用库开方错误处理功能。作为一种优化,它首先试图通过内联 sqrtsd 指令来执行的平方根,然后利用检查结果对自身的 ucomisd 指令用于设置标志如下:


  CASE(和结果)的
   无序:ZF,PF,CF 111;
   GREATER_THAN:ZF,PF,CF 000;
   LESS_THAN:ZF,PF,CF 001;
   EQUAL:ZF,PF,CF 100;
ESAC;


在特定的,比较一个原来的QNaN 来自己将返回无序,也就是如果你尝试,你会得到什么取负数的平方根。这是由 JP 分支覆盖。在 JE 检查只是妄想,检查精确的平等。

另外请注意,GCC有一个 -ffast-数学选项,它会牺牲这个错误处理速度。

I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter.

double sum = 0.0;
int i;
tm = readTSC();
for ( i = 0; i < n; i++ )
   sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks in total\n",tm);
printf("%15.6e\n",sum);

However, as I printed out the assembly code using

gcc -S timing.c -o timing.s

on an Intel machine, the result (shown below) was surprising?

Why there are two sqrts in the assembly code with one using the sqrtsd instruction and the other using a function call? Is it related to loop unrolling and trying to execute two sqrts in one iteration?

And how to understand the line

ucomisd %xmm0, %xmm0

Why does it compare %xmm0 to itself?

//----------------start of for loop----------------
call    readTSC
movq    %rax, -32(%rbp)
movl    $0, -4(%rbp)
jmp .L4
.L6:
cvtsi2sd    -4(%rbp), %xmm1
// 1. use sqrtsd instruction
sqrtsd  %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp  .L8
je  .L5
.L8:
movapd  %xmm1, %xmm0
// 2. use C funciton call
call    sqrt
.L5:
movsd   -16(%rbp), %xmm1
addsd   %xmm1, %xmm0
movsd   %xmm0, -16(%rbp)
addl    $1, -4(%rbp)
.L4:
movl    -4(%rbp), %eax
cmpl    -36(%rbp), %eax
jl  .L6
//----------------end of for loop----------------
call    readTSC

解决方案

It's using the library sqrt function for error handling. As an optimization, it first tries to perform the square root by the inlined sqrtsd instruction, then checks the result against itself using the ucomisd instruction which sets the flags as follows:

CASE (RESULT) OF
   UNORDERED:    ZF,PF,CF  111;
   GREATER_THAN: ZF,PF,CF  000;
   LESS_THAN:    ZF,PF,CF  001;
   EQUAL:        ZF,PF,CF  100;
ESAC;

In particular, comparing a QNaN to itself will return UNORDERED, which is what you will get if you try to take the square root of a negative number. This is covered by the jp branch. The je check is just paranoia, checking for exact equality.

Also note that gcc has a -ffast-math option which will sacrifice this error handling for speed.

这篇关于为什么编译器生成编译的程序集code额外sqrts的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆