是GNU gprof越野车吗? [英] Is GNU gprof buggy?
问题描述
我有一个C程序,通过函数pi_calcPiBlock
调用函数pi_calcPiItem()
600000000次.因此,为了分析在功能上花费的时间,我使用了GNU gprof.结果似乎是错误的,因为所有调用都归属于main()
.此外,调用图没有任何意义:
I have a C program that calls a function pi_calcPiItem()
600000000 times through the function pi_calcPiBlock
. So to analyze the time spent in the functions I used GNU gprof. The result seems to be erroneous since all calls are attributed to main()
instead. Furthermore the call graph does not make any sense:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
61.29 9.28 9.28 pi_calcPiItem
15.85 11.68 2.40 pi_calcPiBlock
11.96 13.49 1.81 _mcount_private
9.45 14.92 1.43 __fentry__
1.45 15.14 0.22 pow
0.00 15.14 0.00 600000000 0.00 0.00 main
Call graph
granularity: each sample hit covers 4 byte(s) for 0.07% of 15.14 seconds
index % time self children called name
<spontaneous>
[1] 61.3 9.28 0.00 pi_calcPiItem [1]
-----------------------------------------------
<spontaneous>
[2] 15.9 2.40 0.00 pi_calcPiBlock [2]
0.00 0.00 600000000/600000000 main [6]
-----------------------------------------------
<spontaneous>
[3] 12.0 1.81 0.00 _mcount_private [3]
-----------------------------------------------
<spontaneous>
[4] 9.4 1.43 0.00 __fentry__ [4]
-----------------------------------------------
<spontaneous>
[5] 1.5 0.22 0.00 pow [5]
-----------------------------------------------
6 main [6]
0.00 0.00 600000000/600000000 pi_calcPiBlock [2]
[6] 0.0 0.00 0.00 600000000+6 main [6]
6 main [6]
-----------------------------------------------
这是一个错误还是我必须以某种方式配置该程序?
Is this a bug or do I have to configure the program somehow?
<spontaneous>
是什么意思?
编辑(为您提供更多信息)
EDIT (more insight for you)
代码全部与pi的计算有关:
The code is all about the calculation of pi:
#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)
int32_t main(int32_t argc, char* argv[]) {
double result;
for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
}
printf("pi = %f\n",result);
return 0;
}
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
double piItem;
for ( int32_t i = start; i < end; ++i ) {
pi_calcPiItem(&piItem, i);
*result += piItem;
}
}
static void pi_calcPiItem(double* piItem, int32_t index) {
*piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}
这就是我得到结果的方式(在Windows的Cygwin的帮助下执行):
And this is how I got the results (executed on Windows with the help of Cygwin):
> gcc -std=c99 -o pi *.c -pg -fno-inline-small-functions
> ./pi.exe
> gprof.exe pi.exe
推荐答案
尝试:
- 使用
noinline
,noclone
函数属性而不是-fno-inline-small-functions
- 通过拆卸
main
,我可以看到-fno-inline-small-functions
不会停止内联
- 通过拆卸
- Using the
noinline
,noclone
function attributes instead of-fno-inline-small-functions
- By disassembling
main
I could see that-fno-inline-small-functions
doesn't stop inlining
- By disassembling
这在Linux x86-64上对我有用:
This worked for me on Linux, x86-64:
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)
static void pi_calcPiItem(double* piItem, int32_t index);
static void pi_calcPiBlock(double* result, int32_t start, int32_t end);
int32_t main(int32_t argc, char* argv[]) {
double result;
result = 0.0;
for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
}
printf("pi = %f\n",result);
return 0;
}
__attribute__((noinline, noclone))
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
double piItem;
for ( int32_t i = start; i < end; ++i ) {
pi_calcPiItem(&piItem, i);
*result += piItem;
}
}
__attribute__((noinline, noclone))
static void pi_calcPiItem(double* piItem, int32_t index) {
*piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}
构建代码
$ cc pi.c -o pi -Os -Wall -g3 -I. -std=c99 -pg -static -lm
输出
$ ./pi && gprof ./pi
pi = 3.141593
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
85.61 22.55 22.55 __ieee754_pow_sse2
4.75 23.80 1.25 pow
4.14 24.89 1.09 600000000 1.82 1.82 pi_calcPiItem
2.54 25.56 0.67 __exp1
0.91 25.80 0.24 pi_calcPiBlock
0.53 25.94 0.14 matherr
0.47 26.07 0.13 __lseek_nocancel
0.38 26.17 0.10 frame_dummy
0.34 26.26 0.09 __ieee754_exp_sse2
0.32 26.34 0.09 __profile_frequency
0.00 26.34 0.00 1 0.00 0.00 main
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.04% of 26.34 seconds
index % time self children called name
<spontaneous>
[1] 85.6 22.55 0.00 __ieee754_pow_sse2 [1]
-----------------------------------------------
<spontaneous>
[2] 5.0 0.24 1.09 pi_calcPiBlock [2]
1.09 0.00 600000000/600000000 pi_calcPiItem [4]
-----------------------------------------------
<spontaneous>
[3] 4.7 1.25 0.00 pow [3]
-----------------------------------------------
1.09 0.00 600000000/600000000 pi_calcPiBlock [2]
[4] 4.1 1.09 0.00 600000000 pi_calcPiItem [4]
-----------------------------------------------
<spontaneous>
[5] 2.5 0.67 0.00 __exp1 [5]
-----------------------------------------------
<spontaneous>
[6] 0.5 0.14 0.00 matherr [6]
-----------------------------------------------
<spontaneous>
[7] 0.5 0.13 0.00 __lseek_nocancel [7]
-----------------------------------------------
<spontaneous>
[8] 0.4 0.10 0.00 frame_dummy [8]
-----------------------------------------------
<spontaneous>
[9] 0.3 0.09 0.00 __ieee754_exp_sse2 [9]
-----------------------------------------------
<spontaneous>
[10] 0.3 0.09 0.00 __profile_frequency [10]
-----------------------------------------------
0.00 0.00 1/1 __libc_start_main [827]
[11] 0.0 0.00 0.00 1 main [11]
-----------------------------------------------
评论
如预期的那样,pow()
是瓶颈.当pi
运行时,perf top
(基于采样的系统分析器)还显示__ieee754_pow_sse2
占用了60%以上的CPU.按照@Mike Dunlavey的建议将pow(-1.0,index)
更改为((i & 1) ? -1.0 : 1.0)
,可使代码快4倍左右.
As expected, pow()
is the bottleneck. While pi
is running, perf top
(sampling based system profiler) also shows __ieee754_pow_sse2
taking 60%+ of CPU. Changing pow(-1.0,index)
to ((i & 1) ? -1.0 : 1.0)
as @Mike Dunlavey suggested makes the code roughly 4 times faster.
这篇关于是GNU gprof越野车吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!