64位可执行程序的运行速度比32位版本的慢 [英] 64-bit executable runs slower than 32-bit version

查看:633
本文介绍了64位可执行程序的运行速度比32位版本的慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个64位的Ubuntu 13.04系统。我很好奇,看到一个64位系统上的32位应用程序对64位应用程序如何执行,所以我整理了以下C程序为32位和64位可执行文件,并记录他们把执行的时间。我用gcc标志编译为3个不同的架构:


  • -m32 :英特尔80386架构(的int,long,指针都设置为32
    比特(ILP32))

  • -m64 :AMD的x86-64架构(INT 32位;长,指针64位(LP64))

  • -mx32 :AMD的x86-64架构(的int,long,指针都设置为32位(ILP32),但在长模式CPU十六个64B寄存器,
    记录通话ABI)

  //这个方案解决了
//项目欧拉问题16。#包括LT&;&stdio.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&math.h中GT;
#包括LT&;&ASSERT.H GT;
#包括LT&; SYS / time.h中>INT sumdigit(int类型的,INT B);诠释主要(无效){
    int类型的= 2;
    INT B = 10000;
    timeval结构入手,整理;
    无符号整型我;
    函数gettimeofday(安培;启动,NULL);
    对于(i = 0; I< 1000;我++)
        (无效)sumdigit(A,B);
    函数gettimeofday(安培;光洁度,NULL);
    的printf(难道%U在%0.4克秒的\\ n来电
            一世,
            finish.tv_sec - start.tv_sec + 1E-6 *(finish.tv_usec - start.tv_usec));
    返回0;
}INT sumdigit(int类型的,INT B){
    // numlen =在^ B号位的
    // pcount =的'A'后第i个迭代电源
    // DCOUNT =数字的在一个^(pcount)数    INT numlen =(int)的(b *的log10的(一))+ 1;
    字符* ARR =释放calloc(numlen,sizeof的* ARR);
    INT pcount = 0;
    INT DCOUNT = 1;
    改编[numlen - 1] = 1;
    INT I,总之,随身携带;    而(pcount< B){
        pcount + = 1;        总和= 0;
        携带= 0;        对于(I = numlen - 1; I> = numlen - DCOUNT; --i){
            总和= ARR [I] * A +随身携带;
            携带= SUM / 10;
            改编[I] = SUM 10%;
        }        而(进大于0){
            DCOUNT + = 1;
            总和= ARR [numlen - DCOUNT] +矣;
            携带= SUM / 10;
            ARR [numlen - DCOUNT] = SUM 10%;
        }
    }    INT结果为0;
    对于(I = numlen - DCOUNT; I< numlen ++ I)
        结果+ =改编[I]    免费(ARR);
    返回结果;
}

我用得到不同的可执行的命令:

 的gcc -std = C99 -Wall -Wextra -Werror -pedantic -pedantic-错误pe16.c -o pe16_x32 -lm -mx32
GCC -std = C99 -Wall -Wextra -Werror -pedantic -pedantic-错误pe16.c -o pe16_32 -lm -m32
GCC -std = C99 -Wall -Wextra -Werror -pedantic -pedantic-错误pe16.c -o pe16_64 -lm

下面是我得到的结果:

 阿贾伊@阿贾伊:C $ ./pe16_x32
难道1000调用89.19秒阿贾伊@阿贾伊:C $ ./pe16_32
难道1000调用88.82秒阿贾伊@阿贾伊:C $ ./pe16_64
难道1000调用92.05秒

为什么64位版本运行速度比32位单慢?我读了64位架构改进了指令集和两次通用寄存器相比,32位架构,它允许更多的优化。我什么时候可以期待在64位系统上的性能更好?

修改
我打开使用 -O3 标记的优化,现在的结果是:

 阿贾伊@阿贾伊:C $ ./pe16_x32
难道1000调用38.07秒阿贾伊@阿贾伊:C $ ./pe16_32
难道1000调用38.32秒阿贾伊@阿贾伊:C $ ./pe16_64
难道1000调用38.27秒


解决方案

没有的优化比较code的性能是相当无意义。如果你关心性能,你永远只能使用优化code。

当您启用的优化你发现的性能差异可以忽略不计。这是可以预料的。执行的操作是所有基于整数运算,在所有情况下使用相同大小的数据。由于在相同的硬件整数单位的32位和64位code运行,你应该期望相同的性能。

您不使用任何浮点运算是那里有时会因不同的硬件浮点单元32位和64位code之间的差异的一个领域(64位使用SSE,X86可以使用的x87)。

在短,其结果是精确地按预期

I have a 64-bit Ubuntu 13.04 system. I was curious to see how 32-bit applications perform against 64-bit applications on a 64-bit system so I compiled the following C program as 32-bit and 64-bit executable and recorded the time they took to execute. I used gcc flags to compile for 3 different architectures:

  • -m32: Intel 80386 architecture (int, long, pointer all set to 32 bits (ILP32))
  • -m64: AMD's x86-64 architecture (int 32 bits; long, pointer 64 bits (LP64))
  • -mx32: AMD's x86-64 architecture (int, long, pointer all set to 32 bits (ILP32), but CPU in long mode with sixteen 64b registers, and register call ABI)

// this program solves the
// project euler problem 16.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <sys/time.h>

int sumdigit(int a, int b);

int main(void) {
    int a = 2;
    int b = 10000;
    struct timeval start, finish;
    unsigned int i;
    gettimeofday(&start, NULL);
    for(i = 0; i < 1000; i++)
        (void)sumdigit(a, b);
    gettimeofday(&finish, NULL);
    printf("Did %u calls in %.4g seconds\n", 
            i, 
            finish.tv_sec - start.tv_sec + 1E-6 * (finish.tv_usec - start.tv_usec));
    return 0;
}

int sumdigit(int a, int b) {
    // numlen = number of digit in a^b
    // pcount = power of 'a' after ith iteration
    // dcount = number of digit in a^(pcount)

    int numlen = (int) (b * log10(a)) + 1;
    char *arr = calloc(numlen, sizeof *arr);
    int pcount = 0;
    int dcount = 1;
    arr[numlen - 1] = 1;
    int i, sum, carry;

    while(pcount < b) {
        pcount += 1;

        sum = 0; 
        carry = 0;

        for(i = numlen - 1; i >= numlen - dcount; --i) {
            sum = arr[i] * a + carry;
            carry = sum / 10;
            arr[i] = sum % 10;
        }

        while(carry > 0) {
            dcount += 1;
            sum = arr[numlen - dcount] + carry;
            carry = sum / 10;
            arr[numlen - dcount] = sum % 10;
        } 
    }

    int result = 0;
    for(i = numlen - dcount; i < numlen; ++i)
        result += arr[i];

    free(arr);
    return result;
}

The commands I used to get different executable:

gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_x32 -lm -mx32
gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_32 -lm -m32
gcc -std=c99 -Wall -Wextra -Werror -pedantic -pedantic-errors pe16.c -o pe16_64 -lm

Here are the results I got:

ajay@ajay:c$ ./pe16_x32
Did 1000 calls in 89.19 seconds

ajay@ajay:c$ ./pe16_32
Did 1000 calls in 88.82 seconds

ajay@ajay:c$ ./pe16_64
Did 1000 calls in 92.05 seconds

Why does the 64-bit version runs slower than the 32-bit one? I read that the 64-bit architecture has improved instruction set and twice more general purpose registers compared to the 32-bit architecture which allows for more optimizations. When can I expect a better performance on a 64-bit system?

Edit I turned on the optimization using -O3 flag and now the results are:

ajay@ajay:c$ ./pe16_x32
Did 1000 calls in 38.07 seconds

ajay@ajay:c$ ./pe16_32
Did 1000 calls in 38.32 seconds

ajay@ajay:c$ ./pe16_64
Did 1000 calls in 38.27 seconds

解决方案

Comparing performance of code without optimisations is rather pointless. If you care about performance, you'll only ever use optimised code.

And when you enable optimisations you find that the performance differences are negligible. That is to be expected. The operations you perform are all integer based operations, using data of the same size in all cases. Since the 32 bit and 64 bit code run on the same integer hardware units you should expect the same performance.

You are not using any floating point operations which is one area where there are sometimes differences between 32 and 64 bit code due to different floating point hardware units (x64 uses SSE, x86 may use x87).

In short, the results are exactly as expected.

这篇关于64位可执行程序的运行速度比32位版本的慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆