GCC的尾巴呼叫优化如何'聪明'? [英] How 'smart' is GCC's Tail-Call-Optimisation?
问题描述
For-Loop:
#include< stdio.h>
#define n(196607)
int main(){
long loop;
int count = 0;
for(loop = 0; loop< n; loop ++){
count ++;
}
printf(Result =%d \\\
,count);
返回0;
}
递归:
#include< stdio.h>
#define n(196607)
long递归(长循环){
return(loop> 0)?递归(loop-1)+1:0;
}
int main(){
long result;
result = recursive(n);
printf(Result =%d \\\
,结果);
返回0;
}
在看到这段代码时,我看到了递归(循环-1)+1
并且认为啊,这不是尾调用递归,因为它在调用递归
完成后有工作。它需要递增返回值。
果然,没有优化,递归代码会触发堆栈溢出,就像您期望的那样。
加上 -O2
标志,但是没有遇到堆栈溢出,我认为堆栈被重用,而不是推动更多和更多的到堆栈 - 这是tco。
GCC显然可以检测到这个简单的情况(+1返回值)并优化它,但它走多远?
当递归调用不是最后一个要执行的操作时,gcc可以使用tco优化哪些限制?
附录:
我写了一个完全尾递归返回函数();
版本的代码。
在9999999迭代循环中包装上述代码,我提出了以下时机:
$ for f in *。可执行程序;做时间./$f>结果;在'* .exe'
+ ./forLoop.c.exe
真实0m3.650s
用户0m3.588s
中完成
+ sys 0m0.061s
+ for'* .exe'
+ ./recursive.c.exe
real 0m3.682s
user 0m3.588s
sys 0m0.093s
+ for'* .exe'
+ ./tail_recursive.c.exe
real 0m3.697s
user 0m3 .588s
sys 0m0.077s
所以一个(很简单但不是非常严格的)基准表明它确实看起来与所花费的时间顺序相同。
解决方案只需反汇编代码,看看发生了什么。没有优化,我得到了这个:
pre $ code> 0x0040150B cmpl $ 0x0,0x10(%rbp)
0x0040150F jle 0x401523<递归+ 35>
0x00401511 mov 0x10(%rbp),%eax
0x00401514 sub $ 0x1,%eax
0x00401517 mov%eax,%ecx
0x00401519 callq 0x401500< recursive>
但是使用-O1,-O2或-O3,我得到了这个:
0x00402D09 mov $ 0x2ffff,%edx
这与尾部优化没有任何关系,但更激进的优化。编译器简单地将整个函数内联并预先计算出结果。
这可能就是为什么你最终在所有不同的基准测试案例中获得相同结果的原因。
I just had a discussion where the following two peices of C code were being discussed:
For-Loop:
#include <stdio.h>
#define n (196607)
int main() {
long loop;
int count=0;
for (loop=0;loop<n;loop++) {
count++;
}
printf("Result = %d\n",count);
return 0;
}
Recursive:
#include <stdio.h>
#define n (196607)
long recursive(long loop) {
return (loop>0) ? recursive(loop-1)+1: 0;
}
int main() {
long result;
result = recursive(n);
printf("Result = %d\n",result);
return 0;
}
On seeing this code, I saw recursive(loop-1)+1
and thought "ah, that's not tail call recursive" because it has work to do after the call to recursive
is complete; it needs to increment the return value.
Sure enough, with no optimisation, the recursive code triggers a stack overflow, as you would expect.
with the -O2
flag however, the stack overflow is not encountered, which I take to mean that the stack is reused, rather than pushing more and more onto the stack - which is tco.
GCC can obviously detect this simple case (+1 to return value) and optimise it out, but how far does it go?
What are the limits to what gcc can optimise with tco, when the recursive call isn't the last operation to be performed?
Addendum:
I've written a fully tail recursive return function();
version of the code.
Wrapping the above in a loop with 9999999 iterations, I came up with the following timings:
$ for f in *.exe; do time ./$f > results; done
+ for f in '*.exe'
+ ./forLoop.c.exe
real 0m3.650s
user 0m3.588s
sys 0m0.061s
+ for f in '*.exe'
+ ./recursive.c.exe
real 0m3.682s
user 0m3.588s
sys 0m0.093s
+ for f in '*.exe'
+ ./tail_recursive.c.exe
real 0m3.697s
user 0m3.588s
sys 0m0.077s
so a (admittedly simple and not very rigorous) benchmark shows that it does indeed seem to be in the same order of time taken.
Just disassemble the code and see what happened. Without optimizations, I get this:
0x0040150B cmpl $0x0,0x10(%rbp)
0x0040150F jle 0x401523 <recursive+35>
0x00401511 mov 0x10(%rbp),%eax
0x00401514 sub $0x1,%eax
0x00401517 mov %eax,%ecx
0x00401519 callq 0x401500 <recursive>
But with -O1, -O2 or -O3 I get this:
0x00402D09 mov $0x2ffff,%edx
This doesn't have anything to do with tail optimizations, but much more radical optimizations. The compiler simply inlined the whole function and pre-calculated the result.
This is likely why you end up with the same result in all your different cases of benchmarking.
这篇关于GCC的尾巴呼叫优化如何'聪明'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!