为什么 gcc 数学库效率这么低? [英] Why is the gcc math library so inefficient?

查看:26
本文介绍了为什么 gcc 数学库效率这么低?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我将一些 fortran 代码移植到 c 时,令我惊讶的是,使用 ifort(intel fortran 编译器)编译的 fortran 程序和使用 gcc 编译的 c 程序之间的大部分执行时间差异来自三角函数的评估函数(sincos).这让我感到惊讶,因为我曾经相信这个 answer 解释的内容,正弦和余弦等函数是在微处理器内部的微码中实现的.

When I was porting some fortran code to c, it surprised me that the most of the execution time discrepancy between the fortran program compiled with ifort (intel fortran compiler) and the c program compiled with gcc, comes from the evaluations of trigonometric functions (sin, cos). It surprised me because I used to believe what this answer explains, that functions like sine and cosine are implemented in microcode inside microprocessors.

为了更明确地发现问题,我在 fortran 中做了一个小测试程序

In order to spot the problem more explicitly I made a small test program in fortran

program ftest
  implicit none
  real(8) :: x
  integer :: i
  x = 0d0
  do i = 1, 10000000
    x = cos (2d0 * x)
  end do
  write (*,*) x
end program ftest

intel Q6600 处理器和 3.6.9-1-ARCH x86_64 Linux 上我得到了 ifort 版本 12.1.0

On intel Q6600 processor and 3.6.9-1-ARCH x86_64 Linux I get with ifort version 12.1.0

$ ifort -o ftest ftest.f90 
$ time ./ftest
  -0.211417093282753     

real    0m0.280s
user    0m0.273s
sys     0m0.003s

使用 gcc 版本 4.7.2 我得到了

$ gfortran -o ftest ftest.f90 
$ time ./ftest
  0.16184945593939115     

real    0m2.148s
user    0m2.090s
sys     0m0.003s

这几乎是 10 倍的差异!我仍然可以相信 cos 的 gcc 实现是微处理器实现的包装器,其方式可能与 intel 实现中的实现方式相似吗?如果这是真的,瓶颈在哪里?

This is almost a factor of 10 difference! Can I still believe that the gcc implementation of cos is a wrapper around the microprocessor implementation in a similar way as this is probably done in the intel implementation? If this is true, where is the bottle neck?

编辑

根据评论,启用的优化应该会提高性能.我的观点是优化不会影响库函数……这并不意味着我不在非平凡的程序中使用它们.但是,这里有两个额外的基准测试(现在在我的家用电脑 intel core2 上)

According to comments, enabled optimizations should improve the performance. My opinion was that optimizations do not affect the library functions ... which does not mean that I don't use them in nontrivial programs. However, here are two additional benchmarks (now on my home computer intel core2)

$ gfortran -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m2.993s
user    0m2.986s
sys     0m0.000s

$ gfortran -Ofast -march=native -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m2.967s
user    0m2.960s
sys     0m0.003s

您(评论员)想到了哪些特定的优化?在这个特定示例中,编译器如何利用多核处理器,其中每次迭代都取决于前一次的结果?

Which particular optimizations did you (commentators) have in mind? And how can compiler exploit a multi-core processor in this particular example, where each iteration depends on the result of the previous one?

编辑 2

Daniel Fisher 和 Ilmari Karonen 的基准测试让我认为问题可能与 gcc 的特定版本(4.7.2)有关,也可能与我正在使用的特定版本(Arch x86_64 Linux)有关我的电脑.所以我用 debian x86_64 Linuxgcc version 4.4.5ifort version 12.1 在 intel core i7 盒子上重复了测试.0

The benchmark tests of Daniel Fisher and Ilmari Karonen made me think that the problem might be related to the particular version of gcc (4.7.2) and maybe to a particular build of it (Arch x86_64 Linux) that I am using on my computers. So I repeated the test on the intel core i7 box with debian x86_64 Linux, gcc version 4.4.5 and ifort version 12.1.0

$ gfortran -O3 -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m0.272s
user    0m0.268s
sys     0m0.004s

$ ifort -O3 -o ftest ftest.f90
$ time ./ftest
  -0.211417093282753     

real    0m0.178s
user    0m0.176s
sys     0m0.004s

对我来说,这是一个非常可接受的性能差异,我永远不会问这个问题.看来我将不得不在 Arch Linux 论坛上询问这个问题.

For me this is a very much acceptable performance difference, which would never make me ask this question. It seems that I will have to ask on Arch Linux forums about this issue.

不过,对整个故事的解释还是很受欢迎的.

However, the explanation of the whole story is still very welcome.

推荐答案

这大部分是由于数学库的差异.需要考虑的几点:

Most of this is due to differences in the math library. Some points to consider:

  • 是的,带有 x87 单元的 x86 处理器具有 fsin 和 fcos 指令.但是,它们是在微码中实现的,没有什么特别的理由说明它们必须比纯软件实现更快.
  • GCC 没有自己的数学库,而是使用系统提供的数学库.在 Linux 上,这通常由 glibc 提供.
  • 32 位 x86 glibc 使用 fsin/fcos.
  • x86_64 glibc 使用使用 SSE2 单元的软件实现.长期以来,这比只使用 x87 指令的 32 位 glibc 版本慢了很多.但是,已经(最近)进行了改进,因此根据您使用的 glibc 版本,情况可能不再像以前那么糟糕了.
  • 英特尔编译器套件拥有非常快速的数学库 (libimf).此外,它还包括矢量化超越数学函数,通常可以进一步加快这些函数的循环速度.
  • Yes, the x86 processors with the x87 unit has fsin and fcos instructions. However, they are implemented in microcode, and there is not particular reason why they must be faster than a pure software implementation.
  • GCC does not have it's own math library, but rather uses the system provided one. On Linux this is typically provided by glibc.
  • 32-bit x86 glibc uses fsin/fcos.
  • x86_64 glibc uses software implementations using the SSE2 unit. For a long time, this was a lot slower than the 32-bit glibc version which just used the x87 instructions. However, improvements have (somewhat recently) been made, so depending on which glibc version you have the situation might not be as bad anymore as it used to be.
  • The Intel compiler suite is blessed with a VERY fast math library (libimf). Additionally, it includes vectorized transcendental math functions, which can often further speed up loops with these functions.

这篇关于为什么 gcc 数学库效率这么低?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆