Mac和Linux下C ++程序(与GCC一起编译)的巨大性能差异 [英] Huge performance difference of a C++ program (compiled with GCC) under Mac and Linux

查看:122
本文介绍了Mac和Linux下C ++程序(与GCC一起编译)的巨大性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我用C ++写了一个小程序(说实话,它是更多的C +类),并在Mac和Linux机器上测试了性能.

Recently I wrote a small program in C++ (well, to be really honest it's more C plus classes) and tested the performance on both a Mac and Linux machine.

即使硬件具有可比性,但性能与我的看法还是有很大出入.

Even though the hardware is comparable, the performance is so different than I really thing there is something strange going on.

首先是一些细节:

输入:约200MB压缩数据

Input: about 200MB compressed data

程序的操作:先解压缩数据,然后将其加载到内存中,然后执行许多数据访问以执行数据之间的联接.该程序是顺序的(没有其他线程或进程).

Operations of the program: it decompresses the data, then loads it in memory, and perform many data access to perform joins between the data. The program is sequential (no additional threads or processes).

输出:屏幕上将显示一些字符串

Output: some strings to be displayed on the screen

代码是在Linux计算机上使用GCC 4.8.1和在Mac计算机上使用GCC 4.8.2编译的.在这两种情况下,都使用以下参数调用编译器:

The code is compiled using GCC 4.8.1 in the Linux machine and GCC 4.8.2 in the Mac machine. In both cases the compiler is called with the arguments:

gcc -c -O3 -fPIC -MD -MF $(patsubst %.o,%.d,$@) //The last three arguments are to create the dependencies between the files

Mac(OS = mac mavericks 10.9)是一台Macbook pro,配备了2.3 GHz Intel Core I7(是四核)256KB L2高速缓存,6MB L3高速缓存,8GB DDR3 1600Mhz和256GB SSD磁盘.

The Mac (OS=mac mavericks 10.9) machine is a macbook pro equipped with a 2,3 GHz Intel core I7 (it's a quadcore) 256KB L2 cache, 6MB L3 cache, 8GB of DDR3 1600Mhz, and a 256 GB SSD disk.

Linux机器(内核2.6.32-358)具有Intel E5-2620 2.0 GHz(六核)16MB高速缓存,64GB DDR3 1600Mhz和256GB SSD磁盘.两台机器都应该使用Sandy Bridge架构(也许Mac是常春藤桥,但是无论如何这都不会有太大的不同).

The Linux machine (kernel 2.6.32-358) has a Intel E5-2620 2.0 GHz (it's a sixcore) 16MB cache, 64GB of DDR3 1600Mhz, and a 256 GB SSD disk. Both machines should use the Sandy Bridge architecture (maybe the Mac is ivy bridge, but anyway this shouldn't make a big difference).

现在,如果我在linux机器上启动程序,则需要217毫秒才能完成,而如果在Mac机器上启动,则需要132毫秒:这会使linux代码慢1.6倍!!

Now if I launch the program on the linux machine then it takes 217ms to finish while if I launch it in the Mac machine it takes 132ms: this makes the linux code 1.6 times slower!!

现在,我知道这两台机器具有不同的操作系统和硬件,但是我发现这种减速幅度太大,无法由这些因素来证明,我觉得背后必须有其他原因.

Now, I understand that the two machines have different OS and hardware, but I find a such slowdown too large to be justified by these factors, and I feel that there must be some other reason behind it.

请注意,在所有数据都加载到内存中之后才开始计时,我确定程序在这段时间内不会交换到磁盘上.因此,我可以排除问题出在SSD磁盘上.

Notice that this timings were being taken after all the data is loaded in memory, and I'm sure the program does not swap to disk during this time. Therefore, I can exclude that the problem is the SSD disk.

现在,我真的不知道是什么原因导致了这种放缓?内存基本上是相等的,而CPU只是慢一点.

Now, I really don't know what could have caused such slowdown? The memory is basically equivalent, while the CPU is only a bit slower.

难道GCC在linux上比Mac产生了明显更差的代码?

Could it be that GCC produced a sensibly worse code on a linux than a mac?

难道是Linux操作系统明显比Mac差?

Could it be that the Linux OS is sensibly worse than the Mac?

我发现很难相信这两件事.有帮助吗?

I find both things hard to believe. Any help?

我意识到我没有提到我如何进行计时:嗯,我使用了boost chrono库,并且我只测量调用主函数所需的时间.像这样:

I realized that I didn't mention how I do the timings: well, I use the boost chrono library, and I measure only the time necessary to invoke the main function. Something like:

time = now();
function();
duration = now() - time;
print(duration);

经过一些测试,我们设法通过一个更简单(更愚蠢)的程序重现了性能差异:

After some tests, we managed to reproduce the difference of performance with a much simpler (and silly) program:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

char in1[10000000];
char in2[10000000];

static inline uint64_t rdtscp (void) {
    uint64_t low, high;
    uint64_t aux;

    __asm__ __volatile__ (
                    ".byte 0x0f,0x01,0xf9"
                    : "=a" (low), "=d" (high), "=c" (aux)
                    );

    return low | (high << 32);
}

int main(int argc, char** argv) {

    uint64_t counter = rdtscp();

    for(int i = 0; i < 10000000; ++i) {
            in1[i] = (char)i * 200;
            in2[i] = (char)i * 100;
    }

    int joins = 0;
    for(int j = 0; j < 10000000; ++j) {
            int el = in1[j];
            for(int m = 0; m < 10000000; m++) {
                    if (in2[m] == el) {
                            joins++;
                            break;
                    }
            }
    }
    printf("Joins %d Cycles total %ld\n", joins, (rdtscp() - counter));

    return 0;
}

请不要看程序的操作.他们没有意义.我们试图重现的是对内存的访问序列以及对它们的简单操作.

Please don't look at the operations of the program. They make little sense. What we tried to reproduce is a sequence of access to memory and simple operations with them.

我们在Mac上启动了该程序,输出为:

We launched this program on the Mac and the output was:

Joins 10000000 Cycles total 589015641

在linux机器上是:

While on the linux machine it was:

Joins 10000000 Cycles total 838198832

很明显,Linux版本需要更多的CPU周期,这可能是访问内存所必需的.现在的问题是:为什么内存访问速度较慢?

Clearly the linux version requires many more CPU cycles, which are probably needed to access the memory. Now the question is: why is the memory access slower?

一个原因可能是in1和in2不适合CPU高速缓存,这需要访问某些RAM.正如Roy Longbottom所指出的,Linux中的内存确实是ECC,这可能是性能较低的原因.如果我们将其与稍低的CPU速度(沙地和常春藤桥之间的区别)相结合,那么我们可能会对这种区别有很好的解释.

One reason could be that in1 and in2 don't fit in the CPU caches, and this requires some RAM accesses. As pointed by Roy Longbottom the memory in linux is indeed ECC and this could be the reason behing the lower performance. If we combine this with the slightly lower CPU speed, the difference between sandy and ivy bridge then we probably have a good explanation for such difference.

无论如何,谢谢大家的提示!

Anyway, thanks all for the tips!

推荐答案

两个系统都遵循System V AMD64 ABI,因此gcc不应在此有所作为.不幸的是,如今对系统性能的随机影响相当普遍,因此有时您会通过对链接顺序进行重新排序等愚蠢的事情而获得显着的性能差异(参见Mytkowicz等人,``在没有做任何明显错误的情况下产生错误的数据'' , http://citeseer.ist.psu.edu/viewdoc/摘要?doi = 10.1.1.163.8395 )

Both systems follow the System V AMD64 ABI, so gcc shouldn't make a difference there. Unfortunately, random effects in system performance are rather prevalent nowadays, so you can sometimes get significant performance differences through things as silly as re-ordering link order (cf. Mytkowicz et al., ``Producing wrong data without doing anything obviously wrong'' , http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.8395)

以下是一些有关如何分析此问题的建议:

Here are some suggestions for how to analyse this that come to mind:

  1. 进行多次跑步.就我个人而言,我至少进行11次奔跑并比较中位数(以及各个四分位数,但这可能比您可能关心的要多).这样可以避免某些随机效应.
  2. 将所有输出放入文件中以最大程度地减少UI效果.
  3. 检查性能计数器.在Linux上,您可以使用perf' tool. Check for major-faults',这表明您存在需要转至磁盘的页面错误(当然,多次运行不太可能).只有这样,您才能排除磁盘在此处有所不同的情况.不幸的是,据我所知,OS X没有这么简单的方法来收集性能计数器.
  4. 您可以尝试使用-mcpu强制相同的目标指令集.
  5. 比较实际缓存大小. `dmidecode -t cache'在Linux上可以做到这一点,但是您必须是root用户.您的机器可能在那里有相关的区别.
  6. 如果您的程序经过多个阶段,请尝试分别对它们进行基准测试.
  1. Do more than one run. Personally I take at least 11 runs and compare the median (as well as the various quartiles, but that's probably more than you may care about there). This avoids some of the random effects.
  2. Pipe all output into a file to minimise UI effects.
  3. Check your performance counters. On Linux you can use the perf' tool. Check formajor-faults', which suggest that you have page faults that need to go to disk (unlikely on multiple runs, of course). Only then can you exclude that the disk makes a difference there. Unfortunately OS X doesn't (to the best of my knowledge) have as easy a way to collect performance counters.
  4. You can experiment with `-mcpu' to force the same target instruction set.
  5. Compare actual cache sizes. `dmidecode -t cache' does that on the Linux side, but you must be root. Your machines may have relevant differences there.
  6. If your program runs through multiple phases, try benchmarking them individually.

祝你好运!

这篇关于Mac和Linux下C ++程序(与GCC一起编译)的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆