在32位系统上使用int64_t而不是int32_t的性能影响是什么? [英] what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

查看:205
本文介绍了在32位系统上使用int64_t而不是int32_t的性能影响是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的C ++库目前使用time_t来存储时间值。我在一些地方开始需要亚秒精度,所以更大的数据类型将是必要的。此外,在一些地方解决2038年问题可能是有用的。所以我想完全切换到一个基本int64_t值的时间类,以替换所有地方的time_t值。



现在我想知道在32位操作系统或32位CPU上运行此代码时,此类更改对性能的影响。 IIUC编译器将生成代码以使用32位寄存器执行64位算术。但是如果这太慢了,我可能需要使用一种更加区分的方式来处理时间值,这可能会使软件更难以维护。



m感兴趣于:




  • 哪些因素会影响这些操作的性能?可能是编译器和编译器版本;但操作系统或CPU制造/模型是否也影响这一点?正常的32位系统是否使用现代CPU的64位寄存器?

  • 在32位模拟时,哪些操作会特别慢?

  • 在32位系统上使用int64_t / uint64_t时,是否有任何现有的基准测试结果?



  • 我对Linux 2.6(RHEL5,RHEL6)上的g ++ 4.1和4.4感兴趣的主要是英特尔核心2系统;但对于其他系统(如Sparc Solaris + Solaris CC,Windows + MSVC)的情况也很好。

    解决方案

    blockquote>

    哪些因素会影响这些操作的性能?可能是
    编译器和编译器版本;但是操作系统或
    CPU make / model是否也会影响这个呢?


    大多数处理器架构 - 请阅读我在本节中提到的处理器架构的模型)。编译器可能有一些影响,但大多数编译器做得相当不错,所以处理器架构将比编译器有更大的影响。



    操作系统将不会有任何影响(除了如果您更改操作系统,您需要使用不同类型的编译器来更改编译器所做的 case - 但这可能是一个小的影响)。


    正常的32位系统会使用现代CPU的64位寄存器吗?




    这是不可能的。如果系统处于32位模式,它将作为一个32位系统,寄存器的额外32位是完全不可见的,正如系统实际上是一个真正的32位系统 。


    哪些操作在32位模拟时会特别慢?或者几乎没有减速?


    加法和减法更糟糕的是,这些必须按照两个操作的顺序进行,第二个操作需要第一个操作完成 - 如果编译器仅对独立数据生成两个添加操作,则不是这样。



    如果输入参数实际上是64位,则Mulitplication将会变得更糟 - 所以2 ^ 35 * 83比2 ^ 31 * 2 ^ 31差,例如。这是因为处理器可以产生32×32位相当好的一个64位结果 - 约5-10个时钟周期。但是64 x 64位乘法需要一个公平的额外代码,因此需要更长的时间。



    除法是乘法的类似问题 - 但这里可以采取64位输入,将其除以32位值并获取32位值。由于很难预测何时这将工作,64位除法可能几乎总是很慢。



    数据也需要两倍的缓存空间,这可能会影响结果。作为类似的结果,一般分配和传递数据将需要两倍的最小值,因为有两倍的数据操作。



    编译器还需要使用更多的寄存器。


    有没有任何现有的在32位系统上使用int64_t / uint64_t的基准测试结果?




    可能,但我不知道任何。即使有,它只会对你有一定的意义,因为操作的混合对操作的速度是至关重要的。



    如果性能是应用程序的重要组成部分,那么请对您的代码(或其代表部分)进行基准测试。如果Benchmark X给出5%,25%或103%的缓慢结果,如果你的代码在同样的情况下是一个完全不同的速度更慢或更快,这并不重要。


    有没有人对这种效果影响有自己的经验?


    我重新编译了一些使用64位整数用于64位架构的代码,发现性能提高了一些相当大的数量 - 一些位的代码高达25%。



    将操作系统更改为同一操作系统的64位版本,可能会有帮助?





    因为我喜欢找出这些东西的区别是什么,我写了一些代码,并用一些原始模板(仍然学习的位 - 模板不是我最热的话题,我必须说 - 给我bitfiddling和指针算术,我会(通常)得到它正确...)



    这里是我写的代码,试图复制几个常见的功能:

      #include< iostream> 
    #include< cstdint>
    #include< ctime>

    using namespace std;

    static __inline__ uint64_t rdtsc(void)
    {
    unsigned hi,lo;
    __asm__ __volatile__(rdtsc:= a(lo),= d(hi));
    return((uint64_t)lo)|(((uint64_t)hi)<< 32);
    }

    template< typename T>
    static T add_numbers(const T * v,const int size)
    {
    T sum = 0;
    for(int i = 0; i sum + = v [i];
    return sum;
    }


    template< typename T,const int size>
    static T add_matrix(const T v [size] [size])
    {
    T sum [size] = {};
    for(int i = 0; i {
    for(int j = 0; j sum [i] = v [i] [j];
    }
    T tsum = 0;
    for(int i = 0; i tsum + = sum [i];
    return tsum;
    }



    template< typename T>
    static T add_mul_numbers(const T * v,const T mul,const int size)
    {
    T sum = 0;
    for(int i = 0; i sum + = v [i] * mul;
    return sum;
    }

    template< typename T>
    static T add_div_numbers(const T * v,const T mul,const int size)
    {
    T sum = 0;
    for(int i = 0; i sum + = v [i] / mul;
    return sum;
    }


    template< typename T>
    void fill_array(T * v,const int size)
    {
    for(int i = 0; i v [i]
    }

    template< typename T,const int size>
    void fill_array(T v [size] [size])
    {
    for(int i = 0; i for(int j = 0; j v [i] [j] = i + size * j;
    }




    uint32_t bench_add_numbers(const uint32_t v [],const int size)
    {
    uint32_t res = add_numbers(v,size);
    return res;
    }

    uint64_t bench_add_numbers(const uint64_t v [],const int size)
    {
    uint64_t res = add_numbers(v,size);
    return res;
    }

    uint32_t bench_add_mul_numbers(const uint32_t v [],const int size)
    {
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v,c,size);
    return res;
    }

    uint64_t bench_add_mul_numbers(const uint64_t v [],const int size)
    {
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v,c,size);
    return res;
    }

    uint32_t bench_add_div_numbers(const uint32_t v [],const int size)
    {
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v,c,size);
    return res;
    }

    uint64_t bench_add_div_numbers(const uint64_t v [],const int size)
    {
    const uint64_t c =
    uint64_t res = add_div_numbers(v,c,size);
    return res;
    }


    模板< const int size>
    uint32_t bench_matrix(const uint32_t v [size] [size])
    {
    uint32_t res = add_matrix(v);
    return res;
    }
    template< const int size>
    uint64_t bench_matrix(const uint64_t v [size] [size])
    {
    uint64_t res = add_matrix(v);
    return res;
    }


    template< typename T>
    void runbench(T(* func)(const T * v,const int size),const char * name,T * v,const int size)
    {
    fill_array );

    uint64_t long t = rdtsc();
    T res = func(v,size);
    t = rdtsc() - t;
    cout<< result =< res<< endl;
    cout<<名称<< 时钟在时钟< dec<< t < endl;
    }

    template< typename T,const int size>
    void runbench2(T(* func)(const T v [size] [size]),const char * name,T v [size] [size])
    {
    fill_array );

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout<< result =< res<< endl;
    cout<<名称<< 时钟在时钟< dec<< t < endl;
    }


    int main()
    {
    //启动CPU到全速...
    time_t t =时间空值);
    while(t == time(NULL));

    const int vsize = 10000;

    uint32_t v32 [vsize];
    uint64_t v64 [vsize];

    uint32_t m32 [100] [100];
    uint64_t m64 [100] [100];


    runbench(bench_add_numbers,Add 32,v32,vsize);
    runbench(bench_add_numbers,Add 64,v64,vsize);

    runbench(bench_add_mul_numbers,Add Mul 32,v32,vsize);
    runbench(bench_add_mul_numbers,添加Mul 64,v64,vsize);

    runbench(bench_add_div_numbers,Add Div 32,v32,vsize);
    runbench(bench_add_div_numbers,Add Div 64,v64,vsize);

    runbench2(bench_matrix,Matrix 32,m32);
    runbench2(bench_matrix,Matrix 64,m64);
    }

    编译:

      g ++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std = c ++ 0x 

    结果如下:注意:查看以下2016年结果 - 由于64位模式下使用SSE指令的不同,这些结果略显乐观,但在32位模式下没有SSE使用。

      result = 49995000 
    在时钟中添加32时间20784
    result = 49995000
    在时钟中添加64时间30358
    result = 349965000
    在时钟中添加Mul 32时间30182
    result = 349965000
    在时钟中添加Mul 64时间79081
    result = 7137858
    在时钟中添加Div 32时间60167
    result = 7137858
    在时钟中添加Div 64时间457116
    result = 49995000
    Matrix 32时间clock 22831
    result = 49995000
    Matrix 64 time in clocks 23823

    可以看到,加法和乘法不是那么糟。部门真的很糟糕。有趣的是,矩阵相加没有太大差别。



    它在64位上更快,我听到有些人问:
    使用相同的编译器选项,只是-m64而不是-m32 - yupp,更快:

      result = 49995000 
    在时钟中添加32次8366
    result = 49995000
    在时钟中添加64时间16188
    result = 349965000
    在时钟中添加MUL 32时间15943
    result = 349965000
    在时钟中添加Mul 64时间35828
    result = 7137858
    在时钟中添加Div 32时间50176
    result = 7137858
    在时钟中添加Div 64时间50472
    result = 49995000
    矩阵32时钟时钟12294
    result = 49995000
    Matrix 64 time in clocks 14733

    2016
    编译器在32位和64位模式下的四种变体,包括和不包含SSE。



    clang ++作为我平常的编译器。我试着用g ++编译(但是它仍然会是一个不同的版本,因为我更新了我的机器 - 我有一个不同的CPU)。由于g ++无法编译64位的no-sse版本,我没有看到这一点。 (g ++也给出类似的结果)



    作为简表:

     测试名称| no-sse 32 | no-sse 64 | sse 32 | sse 64 | 
    ---------------------------------------------- ------------
    添加uint32_t | 20837 | 10221 | 3701 | 3017 |
    ---------------------------------------------- ------------
    添加uint64_t | 18633 | 11270 | 9328 | 9180 |
    ---------------------------------------------- ------------
    添加Mul 32 | 26785 | 18342 | 11510 | 11562 |
    ---------------------------------------------- ------------
    添加Mul 64 | 44701 17693 | 29213 | 16159 |
    ---------------------------------------------- ------------
    Add Div 32 | 44570 | 47695 | 17713 | 17523 |
    ---------------------------------------------- ------------
    Add Div 64 | 405258 | 52875 | 405150 | 47043 |
    ---------------------------------------------- ------------
    Matrix 32 | 41470 | 15811 | 21542 | 8622 |
    ---------------------------------------------- ------------
    Matrix 64 | 22184 | 15168 | 13757 | 12448 |

    使用编译选项的完整结果。

      $ clang ++ -m32 -mno-sse 32vs64.cpp --std = c ++ 11 -O2 
    $ ./a.out
    result = 49995000
    在时钟中添加32次20837
    result = 49995000
    在时钟中添加64次18633
    result = 349965000
    在时钟中添加Mul 32时间26785
    result = 349965000
    在时钟中添加Mul 64时间44701
    result = 7137858
    将Div 32时间添加到时钟44570
    result = 7137858
    在时钟中添加Div 64时间405258
    result = 49995000
    矩阵32时钟在时钟41470
    结果= 49995000
    矩阵64时钟在时钟22184

    $ clang ++ -m32 -msse 32vs64.cpp --std = c ++ 11 -O2
    $ ./a.out
    result = 49995000
    在时钟中添加32次3701
    result = 49995000
    在时钟中添加64时间9328
    result = 349965000
    在时钟中添加Mul 32时间11510
    result = 349965000
    添加Mul 64时间时钟29213
    result = 7137858
    在时钟中添加Div 32时间17713
    result = 7137858
    在时钟中添加Div 64时间405150
    result = 49995000
    矩阵32 time in clocks 21542
    result = 49995000
    Matrix 64时钟时钟13757


    $ clang ++ -m64 -msse 32vs64.cpp --std = c ++ 11 -O2
    $ ./a.out
    result = 49995000
    在时钟中添加32时间3017
    result = 49995000
    在时钟中添加64时间9180
    result = 349965000
    在时钟中添加Mul 32时间11562
    result = 349965000
    在时钟中添加Mul 64时间16159
    result = 7137858
    在时钟中添加Div 32时间17523
    result = 7137858
    在时钟中添加Div 64时间47043
    result = 49995000
    Matrix 32时钟时钟8622
    结果= 49995000
    矩阵64时间在时钟12448


    $ clang ++ -m64 -mno-sse 32vs64.cpp --std = c ++ 11 -O2
    $ ./a.out
    result = 49995000
    在时钟中添加32时间10221
    result = 49995000
    在时钟中添加64时间11270
    result = 349965000
    将Mul 32时间添加到时钟18342
    result = 349965000
    在时钟中添加Mul 64时间17693
    result = 7137858
    在时钟中添加Div 32时间47695
    result = 7137858
    添加Div 64时间时钟52875
    result = 49995000
    矩阵32时钟在时钟15811
    结果= 49995000
    矩阵64时钟在时钟15168


    Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.

    Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.

    What I'm interested in:

    • which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
    • which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
    • are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
    • does anyone have own experience about this performance impact?

    I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).

    解决方案

    which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well?

    Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.

    The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).

    Will a normal 32-bit system use the 64-bit registers of modern CPUs?

    This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

    which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?

    Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.

    Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.

    Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.

    The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.

    The compiler will also need to use more registers.

    are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?

    Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.

    If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.

    does anyone have own experience about this performance impact?

    I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.

    Changing your OS to a 64-bit version of the same OS, would help, perhaps?

    Edit:

    Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )

    Here's the code I wrote, trying to replicate a few common functons:

    #include <iostream>
    #include <cstdint>
    #include <ctime>
    
    using namespace std;
    
    static __inline__ uint64_t rdtsc(void)
    {
        unsigned hi, lo;
        __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
        return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
    }
    
    template<typename T>
    static T add_numbers(const T *v, const int size)
    {
        T sum = 0;
        for(int i = 0; i < size; i++)
        sum += v[i];
        return sum;
    }
    
    
    template<typename T, const int size>
    static T add_matrix(const T v[size][size])
    {
        T sum[size] = {};
        for(int i = 0; i < size; i++)
        {
        for(int j = 0; j < size; j++)
            sum[i] += v[i][j];
        }
        T tsum=0;
        for(int i = 0; i < size; i++)
        tsum += sum[i];
        return tsum;
    }
    
    
    
    template<typename T>
    static T add_mul_numbers(const T *v, const T mul, const int size)
    {
        T sum = 0;
        for(int i = 0; i < size; i++)
        sum += v[i] * mul;
        return sum;
    }
    
    template<typename T>
    static T add_div_numbers(const T *v, const T mul, const int size)
    {
        T sum = 0;
        for(int i = 0; i < size; i++)
        sum += v[i] / mul;
        return sum;
    }
    
    
    template<typename T> 
    void fill_array(T *v, const int size)
    {
        for(int i = 0; i < size; i++)
        v[i] = i;
    }
    
    template<typename T, const int size> 
    void fill_array(T v[size][size])
    {
        for(int i = 0; i < size; i++)
        for(int j = 0; j < size; j++)
            v[i][j] = i + size * j;
    }
    
    
    
    
    uint32_t bench_add_numbers(const uint32_t v[], const int size)
    {
        uint32_t res = add_numbers(v, size);
        return res;
    }
    
    uint64_t bench_add_numbers(const uint64_t v[], const int size)
    {
        uint64_t res = add_numbers(v, size);
        return res;
    }
    
    uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
    {
        const uint32_t c = 7;
        uint32_t res = add_mul_numbers(v, c, size);
        return res;
    }
    
    uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
    {
        const uint64_t c = 7;
        uint64_t res = add_mul_numbers(v, c, size);
        return res;
    }
    
    uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
    {
        const uint32_t c = 7;
        uint32_t res = add_div_numbers(v, c, size);
        return res;
    }
    
    uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
    {
        const uint64_t c = 7;
        uint64_t res = add_div_numbers(v, c, size);
        return res;
    }
    
    
    template<const int size>
    uint32_t bench_matrix(const uint32_t v[size][size])
    {
        uint32_t res = add_matrix(v);
        return res;
    }
    template<const int size>
    uint64_t bench_matrix(const uint64_t v[size][size])
    {
        uint64_t res = add_matrix(v);
        return res;
    }
    
    
    template<typename T>
    void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
    {
        fill_array(v, size);
    
        uint64_t long t = rdtsc();
        T res = func(v, size);
        t = rdtsc() - t;
        cout << "result = " << res << endl;
        cout << name << " time in clocks " << dec << t  << endl;
    }
    
    template<typename T, const int size>
    void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
    {
        fill_array(v);
    
        uint64_t long t = rdtsc();
        T res = func(v);
        t = rdtsc() - t;
        cout << "result = " << res << endl;
        cout << name << " time in clocks " << dec << t  << endl;
    }
    
    
    int main()
    {
        // spin up CPU to full speed...
        time_t t = time(NULL);
        while(t == time(NULL)) ;
    
        const int vsize=10000;
    
        uint32_t v32[vsize];
        uint64_t v64[vsize];
    
        uint32_t m32[100][100];
        uint64_t m64[100][100];
    
    
        runbench(bench_add_numbers, "Add 32", v32, vsize);
        runbench(bench_add_numbers, "Add 64", v64, vsize);
    
        runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
        runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);
    
        runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
        runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);
    
        runbench2(bench_matrix, "Matrix 32", m32);
        runbench2(bench_matrix, "Matrix 64", m64);
    }
    

    Compiled with:

    g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x
    

    And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.

    result = 49995000
    Add 32 time in clocks 20784
    result = 49995000
    Add 64 time in clocks 30358
    result = 349965000
    Add Mul 32 time in clocks 30182
    result = 349965000
    Add Mul 64 time in clocks 79081
    result = 7137858
    Add Div 32 time in clocks 60167
    result = 7137858
    Add Div 64 time in clocks 457116
    result = 49995000
    Matrix 32 time in clocks 22831
    result = 49995000
    Matrix 64 time in clocks 23823
    

    As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.

    And is it faster on 64-bit I hear some of you ask: Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:

    result = 49995000
    Add 32 time in clocks 8366
    result = 49995000
    Add 64 time in clocks 16188
    result = 349965000
    Add Mul 32 time in clocks 15943
    result = 349965000
    Add Mul 64 time in clocks 35828
    result = 7137858
    Add Div 32 time in clocks 50176
    result = 7137858
    Add Div 64 time in clocks 50472
    result = 49995000
    Matrix 32 time in clocks 12294
    result = 49995000
    Matrix 64 time in clocks 14733
    

    Edit, update for 2016: four variants, with and without SSE, in 32- and 64-bit mode of the compiler.

    I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)

    As a short table:

    Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
    ----------------------------------------------------------
    Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
    ----------------------------------------------------------
    Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
    ----------------------------------------------------------
    Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
    ----------------------------------------------------------
    Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
    ----------------------------------------------------------
    Add Div 32     |   44570   |   47695   |  17713 |  17523 |
    ----------------------------------------------------------
    Add Div 64     |  405258   |   52875   | 405150 |  47043 |
    ----------------------------------------------------------
    Matrix 32      |   41470   |   15811   |  21542 |   8622 |
    ----------------------------------------------------------
    Matrix 64      |   22184   |   15168   |  13757 |  12448 |
    

    Full results with compile options.

    $ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
    $ ./a.out
    result = 49995000
    Add 32 time in clocks 20837
    result = 49995000
    Add 64 time in clocks 18633
    result = 349965000
    Add Mul 32 time in clocks 26785
    result = 349965000
    Add Mul 64 time in clocks 44701
    result = 7137858
    Add Div 32 time in clocks 44570
    result = 7137858
    Add Div 64 time in clocks 405258
    result = 49995000
    Matrix 32 time in clocks 41470
    result = 49995000
    Matrix 64 time in clocks 22184
    
    $ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
    $ ./a.out
    result = 49995000
    Add 32 time in clocks 3701
    result = 49995000
    Add 64 time in clocks 9328
    result = 349965000
    Add Mul 32 time in clocks 11510
    result = 349965000
    Add Mul 64 time in clocks 29213
    result = 7137858
    Add Div 32 time in clocks 17713
    result = 7137858
    Add Div 64 time in clocks 405150
    result = 49995000
    Matrix 32 time in clocks 21542
    result = 49995000
    Matrix 64 time in clocks 13757
    
    
    $ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
    $ ./a.out
    result = 49995000
    Add 32 time in clocks 3017
    result = 49995000
    Add 64 time in clocks 9180
    result = 349965000
    Add Mul 32 time in clocks 11562
    result = 349965000
    Add Mul 64 time in clocks 16159
    result = 7137858
    Add Div 32 time in clocks 17523
    result = 7137858
    Add Div 64 time in clocks 47043
    result = 49995000
    Matrix 32 time in clocks 8622
    result = 49995000
    Matrix 64 time in clocks 12448
    
    
    $ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
    $ ./a.out
    result = 49995000
    Add 32 time in clocks 10221
    result = 49995000
    Add 64 time in clocks 11270
    result = 349965000
    Add Mul 32 time in clocks 18342
    result = 349965000
    Add Mul 64 time in clocks 17693
    result = 7137858
    Add Div 32 time in clocks 47695
    result = 7137858
    Add Div 64 time in clocks 52875
    result = 49995000
    Matrix 32 time in clocks 15811
    result = 49995000
    Matrix 64 time in clocks 15168
    

    这篇关于在32位系统上使用int64_t而不是int32_t的性能影响是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆