是GNU gprof越野车吗? [英] Is GNU gprof buggy?

查看:76
本文介绍了是GNU gprof越野车吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个C程序,通过函数pi_calcPiBlock调用函数pi_calcPiItem() 600000000次.因此,为了分析在功能上花费的时间,我使用了GNU gprof.结果似乎是错误的,因为所有调用都归属于main().此外,调用图没有任何意义:

I have a C program that calls a function pi_calcPiItem() 600000000 times through the function pi_calcPiBlock. So to analyze the time spent in the functions I used GNU gprof. The result seems to be erroneous since all calls are attributed to main() instead. Furthermore the call graph does not make any sense:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 61.29      9.28     9.28                             pi_calcPiItem
 15.85     11.68     2.40                             pi_calcPiBlock
 11.96     13.49     1.81                             _mcount_private
  9.45     14.92     1.43                             __fentry__
  1.45     15.14     0.22                             pow
  0.00     15.14     0.00 600000000     0.00     0.00  main

                        Call graph


granularity: each sample hit covers 4 byte(s) for 0.07% of 15.14 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     61.3    9.28    0.00                 pi_calcPiItem [1]
-----------------------------------------------
                                                 <spontaneous>
[2]     15.9    2.40    0.00                 pi_calcPiBlock [2]
                0.00    0.00 600000000/600000000     main [6]
-----------------------------------------------
                                                 <spontaneous>
[3]     12.0    1.81    0.00                 _mcount_private [3]
-----------------------------------------------
                                                 <spontaneous>
[4]      9.4    1.43    0.00                 __fentry__ [4]
-----------------------------------------------
                                                 <spontaneous>
[5]      1.5    0.22    0.00                 pow [5]
-----------------------------------------------
                                   6             main [6]
                0.00    0.00 600000000/600000000     pi_calcPiBlock [2]
[6]      0.0    0.00    0.00 600000000+6       main [6]
                                   6             main [6]
-----------------------------------------------

这是一个错误还是我必须以某种方式配置该程序?

Is this a bug or do I have to configure the program somehow?

<spontaneous>是什么意思?

编辑(为您提供更多信息)

EDIT (more insight for you)

代码全部与pi的计算有关:

The code is all about the calculation of pi:

#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)

int32_t main(int32_t argc, char* argv[]) {
  double result;

  for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
    pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
  }

  printf("pi = %f\n",result);
  return 0;
}

static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
  double piItem;

  for ( int32_t i = start; i < end; ++i ) {
    pi_calcPiItem(&piItem, i);
    *result += piItem;
  }  
}    

static void pi_calcPiItem(double* piItem, int32_t index) {
  *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}

这就是我得到结果的方式(在Windows的Cygwin的帮助下执行):

And this is how I got the results (executed on Windows with the help of Cygwin):

> gcc -std=c99 -o pi *.c -pg -fno-inline-small-functions
> ./pi.exe
> gprof.exe pi.exe

推荐答案

尝试:

  1. 使用noinlinenoclone函数属性而不是-fno-inline-small-functions
    • 通过拆卸main,我可以看到-fno-inline-small-functions不会停止内联
  1. Using the noinline, noclone function attributes instead of -fno-inline-small-functions
    • By disassembling main I could see that -fno-inline-small-functions doesn't stop inlining

这在Linux x86-64上对我有用:

This worked for me on Linux, x86-64:

#include <stdio.h>
#include <stdint.h>
#include <math.h>

#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)

static void pi_calcPiItem(double* piItem, int32_t index);
static void pi_calcPiBlock(double* result, int32_t start, int32_t end);

int32_t main(int32_t argc, char* argv[]) {
  double result;

  result = 0.0;
  for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
    pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
  }

  printf("pi = %f\n",result);
  return 0;
}

__attribute__((noinline, noclone))
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
  double piItem;

  for ( int32_t i = start; i < end; ++i ) {
    pi_calcPiItem(&piItem, i);
    *result += piItem;
  }  
}    

__attribute__((noinline, noclone))
static void pi_calcPiItem(double* piItem, int32_t index) {
  *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}

构建代码

$ cc pi.c -o pi -Os -Wall -g3 -I. -std=c99 -pg -static -lm

输出

$ ./pi && gprof ./pi
pi = 3.141593
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 85.61     22.55    22.55                             __ieee754_pow_sse2
  4.75     23.80     1.25                             pow
  4.14     24.89     1.09 600000000     1.82     1.82  pi_calcPiItem
  2.54     25.56     0.67                             __exp1
  0.91     25.80     0.24                             pi_calcPiBlock
  0.53     25.94     0.14                             matherr
  0.47     26.07     0.13                             __lseek_nocancel
  0.38     26.17     0.10                             frame_dummy
  0.34     26.26     0.09                             __ieee754_exp_sse2
  0.32     26.34     0.09                             __profile_frequency
  0.00     26.34     0.00        1     0.00     0.00  main


             Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.04% of 26.34 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     85.6   22.55    0.00                 __ieee754_pow_sse2 [1]
-----------------------------------------------
                                                 <spontaneous>
[2]      5.0    0.24    1.09                 pi_calcPiBlock [2]
                1.09    0.00 600000000/600000000     pi_calcPiItem [4]
-----------------------------------------------
                                                 <spontaneous>
[3]      4.7    1.25    0.00                 pow [3]
-----------------------------------------------
                1.09    0.00 600000000/600000000     pi_calcPiBlock [2]
[4]      4.1    1.09    0.00 600000000         pi_calcPiItem [4]
-----------------------------------------------
                                                 <spontaneous>
[5]      2.5    0.67    0.00                 __exp1 [5]
-----------------------------------------------
                                                 <spontaneous>
[6]      0.5    0.14    0.00                 matherr [6]
-----------------------------------------------
                                                 <spontaneous>
[7]      0.5    0.13    0.00                 __lseek_nocancel [7]
-----------------------------------------------
                                                 <spontaneous>
[8]      0.4    0.10    0.00                 frame_dummy [8]
-----------------------------------------------
                                                 <spontaneous>
[9]      0.3    0.09    0.00                 __ieee754_exp_sse2 [9]
-----------------------------------------------
                                                 <spontaneous>
[10]     0.3    0.09    0.00                 __profile_frequency [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_start_main [827]
[11]     0.0    0.00    0.00       1         main [11]
-----------------------------------------------

评论

如预期的那样,pow()是瓶颈.当pi运行时,perf top(基于采样的系统分析器)还显示__ieee754_pow_sse2占用了60%以上的CPU.按照@Mike Dunlavey的建议将pow(-1.0,index)更改为((i & 1) ? -1.0 : 1.0),可使代码快4倍左右.

As expected, pow() is the bottleneck. While pi is running, perf top (sampling based system profiler) also shows __ieee754_pow_sse2 taking 60%+ of CPU. Changing pow(-1.0,index) to ((i & 1) ? -1.0 : 1.0) as @Mike Dunlavey suggested makes the code roughly 4 times faster.

这篇关于是GNU gprof越野车吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆