困难来衡量C / C ++的性能 [英] Difficulties to measure C/C++ performance

查看:111
本文介绍了困难来衡量C / C ++的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了件C code,展现在一个关于优化和分支prediction讨论的焦点。这时我才发现更多样化的结果比我期望的那样。我的目标是把它写在为C ++和C之间的共同交集的语言,这是符合标准的两种语言,而且是相当便携。它在不同的Windows电脑进行测试:

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:

#include <stdio.h>
#include <time.h>

/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
    return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}

int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern 
int const Pattern_0_0_2_2 = 0x40000002; 

/// @return - half of Billion  
int unpredictableIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // true, true, false, false ...
        if ( ( i & Pattern_0_0_2_2 ) == 0 )
        {
            ++sum;
        }
    }
    return sum;
}

/// @return - half of Billion  
int noIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // 1, 1, 0, 0 ...
        sum += ( i & Pattern_0_0_2_2 ) == 0;
    }
    return sum;
}

int main()
{
    clock_t volatile start;
    clock_t volatile stop;
    int volatile sum;
    printf( "Puzzling measurements:\n" );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = noIfs();
    stop = clock();
    printf( "Same without ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );
}

编译时VS2010; / O2优化的英特尔酷睿2,WinXP的结果:

Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000

编辑:编译全部开关:

/紫/ NOLOGO / W3 / WX-/ O2 /爱/ Oy- / GL / DWIN32/ DNDEBUG/ D_CONSOLE/ D_UNI code/ DUNI code/ GM-/ EHSC / GS /戈瑞/ FP:precise / ZC:wchar_t的/ ZC:forScope /Fp\"Release\\Trying.pch/法发布\\/ FO发布\\/ FD发布\\ vc100.pdb/ GD / analyze- / errorReport:队列

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

其他人发布这样的......编译使用MinGW,G ++ 4.71,-O1优化英特尔酷睿2,WinXP的结果:

Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000

此外,他公布这样的结果-O3优化:

Also he posted such results for -O3 optimizations:

Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000

现在我有问题。这到底是怎么回事呢?

Now I have question. What is going on here?

更具体地说......试问一个固定功能用不了这么不同的时间?有什么不对我的code?是不是有什么猫腻采用英特尔处理器?是编译器做一些奇怪的?这可能是因为32位code运行在64位处理器?

More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?

感谢您的关注!

编辑:
我接受其他2调用,G ++ -O1只是将回收价值。我也承认,G ++ -O2和g ++ -O3有缺陷,即离开了优化。测量速度显著多样性(450%!),似乎依然神秘。

I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.

我看着由VS2010产生code的拆卸。它没有内联未predictableIfs 3倍。内联code是相当相似;环路是一样的。它没有内联 noIfs 。它没有卷 noIfs 了一下。这需要在一个迭代4个步骤。 noIfs 计算就像是写而未predictableIfs 使用 JNE 跳过增量。

I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.

推荐答案

通过 -O1 ,GCC-4.7.1电话未predictableIfs 只有一次,resuses的结果,因为它认识到,这是一个纯粹的功能,所以结果将是相同的每个,它被称为时间。 (矿一样,通过查看生成的程序集验证。)

With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

使用更高优化级别,内联函数,编译器不能识别它是相同的code了,所以它运行的每个函数调用出现在源的时间。

With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

除此之外,我的gcc-4.7.1交易最好未predictableIfs使用时 -O1 -O2 (除了再利用问题,既产生相同的code),而 noIfs 是治疗 -O3 更好。同样的code的不同运行之间的时序是但一致的位置 - 相同或相差10毫秒(时钟的粒度),所以我不知道有什么事情造成实质上不同的时间未predictableIfs 您报道了 -O3

Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

使用 -O2 ,循环为未predictableIfs 与产生的code与 -O1 (除寄存器交换):

With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

.L12:
    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne .L12

noIfs 这是类似的:

.L15:
    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    cmpl    $1000000000, %eax
    jne .L15

它在哪里。

.L7:
    testl   $1073741826, %edx
    sete    %cl
    movzbl  %cl, %ecx
    addl    %ecx, %eax
    addl    $1, %edx
    cmpl    $1000000000, %edx
    jne .L7

-O1 。两个循环在相似的时间来看,随着未predictableIfs 快一点。

with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

使用 -O3 ,循环为未predictableIfs 变差,

.L14:
    leal    1(%rdx), %ecx
    testl   $1073741826, %eax
    cmove   %ecx, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne     .L14

noIfs (包括设置 - code在这里),它变得更好:

and for noIfs (including the setup-code here), it becomes better:

    pxor    %xmm2, %xmm2
    movq    %rax, 32(%rsp)
    movdqa  .LC3(%rip), %xmm6
    xorl    %eax, %eax
    movdqa  .LC2(%rip), %xmm1
    movdqa  %xmm2, %xmm3
    movdqa  .LC4(%rip), %xmm5
    movdqa  .LC5(%rip), %xmm4
    .p2align 4,,10
    .p2align 3
.L18:
    movdqa  %xmm1, %xmm0
    addl    $1, %eax
    paddd   %xmm6, %xmm1
    cmpl    $250000000, %eax
    pand    %xmm5, %xmm0
    pcmpeqd %xmm3, %xmm0
    pand    %xmm4, %xmm0
    paddd   %xmm0, %xmm2
    jne .L18

.LC2:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC3:
    .long   4
    .long   4
    .long   4
    .long   4
    .align 16
.LC4:
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .align 16
.LC5:
    .long   1
    .long   1
    .long   1
    .long   1

它计算四次迭代一次,因此, noIfs 运行几乎快四倍呢。

这篇关于困难来衡量C / C ++的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆