GCC optmization对位运算效能 [英] Effectiveness of GCC optmization on bit operations

查看:148
本文介绍了GCC optmization对位运算效能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面两种方式对x86-64的设置单个位在C:

 内嵌无效SetBitC(长*数组,int位){
   //纯C版本
   *阵列| = 1<<位;
}内嵌无效SetBitASM(长*数组,int位){
   //使用嵌入式x86汇编
   ASM(BTS%1,%0:+ R(*数组):G(位));
}

使用GCC 4.3 -O3 -march = core2的选项,C版约需 90%以上的时间以恒定使用时。 (两个版本编译完全相同的组装code,除了C版本使用了或[1 LT;< NUM]%RAX 指令而不是 BTS [NUM],RAX%指令)

当一个变量使用 C版性能更好,但仍比内联汇编显著慢。

复位,切换和校验位有类似的结果。

为什么GCC优化如此糟糕的这样一个共同的操作?我做得不对的C版本?

编辑:对不起,漫长的等待,这里是code我曾经基准。它实际上开始作为一个简单的编程问题...

  INT的main(){
    // 1到2 ^ 28获取所有整数之和与11位总是设置
    unsigned long类型I,J,C = 0;
    为(ⅰ= 1; I&下;(1 <<;&下; 28);我++){
        J =;
        SetBit(安培;Ĵ,10);
        C + = j的;
    }
    的printf(结果:吕%\\ n,C);
    返回0;
}GCC -O3 -march = core2的-pg test.c的
./a.out
gprof的
与ASM:101.12 0.08 0.08主
与C:101.12 0.16 0.16主

时间./a.out 也给出了相似的结果。


解决方案

  

为什么GCC优化如此糟糕的这样一个共同的操作?


prelude:自80年代后期,注重编译器优化已经从专注于个人业务和对 macrobenchmarks 其中重点的速度人们关心的应用程序。这些天,大多数编译器编写者都集中在macrobenchmarks,并发展良好的基准套件是认真对待的东西。

:在GCC没有人使用的是基准的区别在哪里 BTS 事项实际的程序的执行时间。如果你会产生这样一个程序,你可能能够得到人民GCC-土地的关注。


  

我做得不对的C版本?


没有,这是非常好的标准C.非常可读和习惯,其实。

Here are two ways to set an individual bit in C on x86-64:

inline void SetBitC(long *array, int bit) {
   //Pure C version
   *array |= 1<<bit;
}

inline void SetBitASM(long *array, int bit) {
   // Using inline x86 assembly
   asm("bts %1,%0" : "+r" (*array) : "g" (bit));
}

Using GCC 4.3 with -O3 -march=core2 options, the C version takes about 90% more time when used with a constant bit. (Both versions compile to exactly the same assembly code, except that the C version uses an or [1<<num],%rax instruction instead of a bts [num],%rax instruction)

When used with a variable bit, the C version performs better but is still significantly slower than the inline assembly.

Resetting, toggling and checking bits have similar results.

Why does GCC optimize so poorly for such a common operation? Am I doing something wrong with the C version?

Edit: Sorry for the long wait, here is the code I used to benchmark. It actually started as a simple programming problem...

int main() {
    // Get the sum of all integers from 1 to 2^28 with bit 11 always set
    unsigned long i,j,c=0;
    for (i=1; i<(1<<28); i++) {
        j = i;
        SetBit(&j, 10);
        c += j;
    }
    printf("Result: %lu\n", c);
    return 0;
}

gcc -O3 -march=core2 -pg test.c
./a.out
gprof
with ASM: 101.12      0.08     0.08                             main
with C:   101.12      0.16     0.16                             main

time ./a.out also gives similar results.

解决方案

Why does GCC optimize so poorly for such a common operation?

Prelude: Since the late 1980s, focus on compiler optimization has moved away from microbenchmarks which focus on individual operations and toward macrobenchmarks which focus on applications whose speed people care about. These days most compiler writers are focused on macrobenchmarks, and developing good benchmark suites is something that is taken seriously.

Answer: Nobody on the gcc is using a benchmark where the difference between or and bts matters to the execution time of a real program. If you can produce such a program, you might be able to get the attention of people in gcc-land.

Am I doing something wrong with the C version?

No, this is perfectly good standard C. Very readable and idiomatic, in fact.

这篇关于GCC optmization对位运算效能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆