为什么__builtin_popcount比我自己的位计数功能慢? [英] Why is __builtin_popcount slower than my own bit counting function?

查看:131
本文介绍了为什么__builtin_popcount比我自己的位计数功能慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在编写了自己的位计数例程后,我偶然发现了__builtin_popcount for gcc.但是,当我切换到__builtin_popcount时,我的软件实际上运行速度较慢.我正在使用2.90GHz英特尔酷睿i3-4130T CPU上的Unbutu.我进行了性能测试,以了解效果如何.看起来像这样:

I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU @ 2.90GHz. I built a performance test to see what gives. It looks like this:

#include <iostream>
#include <sys/time.h>
#include <stdint.h>

using namespace std;

const int bitCount[256] = {
    0,1,1,2,1,2,2,3,  1,2,2,3,2,3,3,4,  1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,  2,3,3,4,3,4,4,5,  2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,  3,4,4,5,4,5,5,6,  3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,  4,5,5,6,5,6,6,7,  4,5,5,6,5,6,6,7,  5,6,6,7,6,7,7,8
};

const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;

inline int countBits(uint32_t bitField)
{
    return
        bitCount[(bitField & m32_0001)      ] +
        bitCount[(bitField & m32_0010) >>  8] +
        bitCount[(bitField & m32_0100) >> 16] +
        bitCount[(bitField & m32_1000) >> 24];
}

inline long long currentTime() {
    struct timeval ct;
    gettimeofday(&ct, NULL);
    return ct.tv_sec * 1000000LL + ct.tv_usec;
}

int main() {
    long long start, delta, sum;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += countBits(i);
    delta = currentTime() - start;
    cout << "countBits         : sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += __builtin_popcount(i);
    delta = currentTime() - start;
    cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;

    start = currentTime();
    sum = 0;
    for(unsigned i = 0; i < 100000000; ++i) {
        int count;
        asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
        sum += count;
    }
    delta = currentTime() - start;
    cout << "assembler         : sum=" << sum << ": time (usec)=" << delta << endl;

    return 0;
}

起初,我是使用较旧的编译器运行的:

At first I ran this with an older compiler:

> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler         : sum=1314447104: time (usec)=138036

如您所见,基于表的countBits几乎与汇编程序一样快,并且比__builtin_popcount快得多.然后,我在另一台机器类型(相同的处理器上,并且我认为主板也相同)上尝试了一种较新的编译器:

As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too):

> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name      : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits         : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler         : sum=1314447104: time (usec)=138028

奇怪的是,较旧的编译器比较新的编译器对我的countBits函数进行了更好的优化,但与汇编程序相比仍具有优势.显然,由于汇编程序行可以编译和运行,因此我的处理器支持popcount,但是为什么__builtin_popcount慢两倍以上?我自己的例程又如何与基于硅的弹出式表盘竞争?我在寻找第一个置位位的其他例程方面也有相同的经验.我的例程都比GNU内置"等效例程快得多.

Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents.

(顺便说一句,我不知道如何编写汇编程序.我只是在某些网页上发现了该行,而且它似乎奇迹般地起作用了.)

(BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)

推荐答案

如果不在命令行上指定适当的"-march",则gcc会生成对 __ popcountdi2 函数而不是的调用popcnt 指令.请参阅: https://godbolt.org/z/z1BihM

Without specifying an appropriate "-march" on the command line gcc generates a call to the __popcountdi2 function rather than the popcnt instruction. See: https://godbolt.org/z/z1BihM

根据维基百科,自Nehalem以来,Intel就支持POPCNT,而自巴塞罗那以来,AMD就支持POPCNT: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT

POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT

这篇关于为什么__builtin_popcount比我自己的位计数功能慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆