使用g ++ 5.3.1编译时,该程序的运行速度比使用g ++ 4.8.4,相同命令编译的同一程序慢3倍. [英] The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command

查看:157
本文介绍了使用g ++ 5.3.1编译时,该程序的运行速度比使用g ++ 4.8.4,相同命令编译的同一程序慢3倍.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我开始在g ++ 5.3.1上使用Ubuntu 16.04,并检查我的程序运行速度慢了3倍. 在此之前,我曾使用过Ubuntu 14.04,g ++ 4.8.4. 我用相同的命令构建它:CFLAGS = -std=c++11 -Wall -O3.

Recently, I've started to use Ubuntu 16.04 with g++ 5.3.1 and checked that my program runs 3 times slower. Before that I've used Ubuntu 14.04, g++ 4.8.4. I built it with the same commands: CFLAGS = -std=c++11 -Wall -O3.

我的程序包含循环,充满了数学调用(sin,cos,exp). 您可以在此处找到它.

My program contains cycles, filled with math calls (sin, cos, exp). You can find it here.

我尝试使用不同的优化标志(O0,O1,O2,O3,Ofast)进行编译,但是在所有情况下都出现了问题(使用Ofast两种变体运行速度更快,但第一个变体运行速度慢了3倍)

I've tried to compile with different optimization flags (O0, O1, O2, O3, Ofast), but in all cases the problem is reproduced (with Ofast both variants run faster, but the first runs 3 times slower still).

在我的程序中,我使用libtinyxml-devlibgslcblas.但是它们在两种情况下都具有相同的版本,并且在性能方面并没有在程序中起任何重要的作用(根据代码和callgrind分析).

In my program I use libtinyxml-dev, libgslcblas. But they have the same versions in both cases and don't take any significant part in the program (according to code and callgrind profiling) in terms of performance.

我已经执行了概要分析,但是它并没有让我知道为什么会发生这种情况. Kcachegrind比较(左侧比较慢). 我只注意到,与Ubuntu 14.04中的libm-2.19相比,该程序现在使用libm-2.23.

I've performed profiling, but it doesn't give me any idea about why it happens. Kcachegrind comparison (left is slower). I've only noticed that now the program uses libm-2.23 compared to libm-2.19 with Ubuntu 14.04.

我的处理器是Haswell i7-5820.

My processor is i7-5820, Haswell.

我不知道为什么它变慢了.你有什么主意吗?

I have no idea why it becomes slower. Do you have any ideas?

P.S.您可以在下面找到最耗时的功能:

P.S. Below you can find the most time consuming function:

void InclinedSum::prepare3D()
{
double buf1, buf2;
double sum_prev1 = 0.0, sum_prev2 = 0.0;
int break_idx1, break_idx2; 
int arr_idx;

for(int seg_idx = 0; seg_idx < props->K; seg_idx++)
{
    const Point& r = well->segs[seg_idx].r_bhp;

    for(int k = 0; k < props->K; k++)
    {
        arr_idx = seg_idx * props->K + k;
        F[arr_idx] = 0.0;

        break_idx2 = 0;

        for(int m = 1; m <= props->M; m++)
        {
            break_idx1 = 0;

            for(int l = 1; l <= props->L; l++)
            {
                buf1 = ((cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x - M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x + M_PI * (double)(l) / props->sizes.z ) + 
                            (cos(M_PI * (double)(m) * well->segs[k].r1.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r1.z / props->sizes.z) -
                            cos(M_PI * (double)(m) * well->segs[k].r2.x / props->sizes.x + M_PI * (double)(l) * well->segs[k].r2.z / props->sizes.z)) /
                        ( M_PI * (double)(m) * tan(props->alpha) / props->sizes.x - M_PI * (double)(l) / props->sizes.z )
                            ) / 2.0;

                buf2 = sqrt((double)(m) * (double)(m) / props->sizes.x / props->sizes.x + (double)(l) * (double)(l) / props->sizes.z / props->sizes.z);

                for(int i = -props->I; i <= props->I; i++)
                {   

                    F[arr_idx] += buf1 / well->segs[k].length / buf2 *
                        ( exp(-M_PI * buf2 * fabs(r.y - props->r1.y + 2.0 * (double)(i) * props->sizes.y)) - 
                        exp(-M_PI * buf2 * fabs(r.y + props->r1.y + 2.0 * (double)(i) * props->sizes.y)) ) *
                        sin(M_PI * (double)(m) * r.x / props->sizes.x) * 
                        cos(M_PI * (double)(l) * r.z / props->sizes.z);
                }

                if( fabs(F[arr_idx] - sum_prev1) > F[arr_idx] * EQUALITY_TOLERANCE )
                {
                    sum_prev1 = F[arr_idx];
                    break_idx1 = 0;
                } else
                    break_idx1++;

                if(break_idx1 > 1)
                {
                    //std::cout << "l=" << l << std::endl;
                    break;
                }
            }

            if( fabs(F[arr_idx] - sum_prev2) > F[arr_idx] * EQUALITY_TOLERANCE )
            {
                sum_prev2 = F[arr_idx];
                break_idx2 = 0;
            } else
                break_idx2++;

            if(break_idx2 > 1)
            {
                std::cout << "m=" << m << std::endl;
                break;
            }
        }
    }
}
}

进一步调查. 我编写了以下简单程序:

Further investigation. I wrote the following simple program:

#include <cmath>
#include <iostream>
#include <chrono>

#define CYCLE_NUM 1E+7

using namespace std;
using namespace std::chrono;

int main()
{
    double sum = 0.0;

    auto t1 = high_resolution_clock::now();
    for(int i = 1; i < CYCLE_NUM; i++)
    {
        sum += sin((double)(i)) / (double)(i);
    }
    auto t2 = high_resolution_clock::now();

    microseconds::rep t = duration_cast<microseconds>(t2-t1).count();

    cout << "sum = " << sum << endl;
    cout << "time = " << (double)(t) / 1.E+6 << endl;

    return 0;
}

我真的很想知道为什么这个简单的示例程序在g ++ 4.8.4 libc-2.19(libm-2.19)下比在g ++ 5.3.1 libc-2.23(libm-2.23)下快2.5.

I am really wondering why this simple sample program is 2.5 faster under g++ 4.8.4 libc-2.19 (libm-2.19) than under g++ 5.3.1 libc-2.23 (libm-2.23).

编译命令为:

g++ -std=c++11 -O3 main.cpp -o sum

使用其他优化标记不会更改比率.

Using other optimization flags don't change the ratio.

我怎么知道gcc或libc是谁在减慢该程序的速度?

How can I understand who, gcc or libc, slow down the program?

推荐答案

这是glibc中的错误,会影响2.23版本(在Ubuntu 16.04中使用)和2.24的早期版本(例如Fedora和Debian已包含修补程序版本)不再受影响,Ubuntu 16.10和17.04尚未受影响.)

This is a bug in glibc that affects versions 2.23 (in use in Ubuntu 16.04) and early versions of 2.24 (e.g. Fedora and Debian already include the patched versions that are no longer affected, Ubuntu 16.10 and 17.04 do not yet).

减速的原因是从SSE到AVX寄存器的转换损失.请在此处查看glibc错误报告: https://sourceware.org/bugzilla/show_bug.cgi ?id = 20495

The slowdown stems from the SSE to AVX register transition penalty. See the glibc bug report here: https://sourceware.org/bugzilla/show_bug.cgi?id=20495

Oleg Strikov在他的Ubuntu错误报告中进行了相当广泛的分析: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280

Oleg Strikov wrote up a quite extensive analysis in his Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280

没有补丁,有多种可能的解决方法:您可以静态编译问题(即添加-static),也可以通过在程序执行期间设置环境变量LD_BIND_NOW来禁用延迟绑定.同样,以上错误报告中有更多详细信息.

Without the patch, there are various possible workarounds: you can compile your problem statically (i.e. add -static) or you can disable lazy binding by setting the environment variable LD_BIND_NOW during the program's execution. Again, more details in the above bug reports.

这篇关于使用g ++ 5.3.1编译时,该程序的运行速度比使用g ++ 4.8.4,相同命令编译的同一程序慢3倍.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆