为什么__builtin_popcount比我自己的位计数功能慢? [英] Why is __builtin_popcount slower than my own bit counting function?
问题描述
在编写了自己的位计数例程后,我偶然发现了__builtin_popcount for gcc.但是,当我切换到__builtin_popcount时,我的软件实际上运行速度较慢.我正在使用2.90GHz英特尔酷睿i3-4130T CPU上的Unbutu.我进行了性能测试,以了解效果如何.看起来像这样:
I stumbled on __builtin_popcount for gcc after I had written my own bit count routines. But when I switched to __builtin_popcount my software actually ran slower. I'm on Unbutu on an Intel Core i3-4130T CPU @ 2.90GHz. I built a performance test to see what gives. It looks like this:
#include <iostream>
#include <sys/time.h>
#include <stdint.h>
using namespace std;
const int bitCount[256] = {
0,1,1,2,1,2,2,3, 1,2,2,3,2,3,3,4, 1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4, 2,3,3,4,3,4,4,5, 2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5, 3,4,4,5,4,5,5,6, 3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6, 4,5,5,6,5,6,6,7, 4,5,5,6,5,6,6,7, 5,6,6,7,6,7,7,8
};
const uint32_t m32_0001 = 0x000000ffu;
const uint32_t m32_0010 = 0x0000ff00u;
const uint32_t m32_0100 = 0x00ff0000u;
const uint32_t m32_1000 = 0xff000000u;
inline int countBits(uint32_t bitField)
{
return
bitCount[(bitField & m32_0001) ] +
bitCount[(bitField & m32_0010) >> 8] +
bitCount[(bitField & m32_0100) >> 16] +
bitCount[(bitField & m32_1000) >> 24];
}
inline long long currentTime() {
struct timeval ct;
gettimeofday(&ct, NULL);
return ct.tv_sec * 1000000LL + ct.tv_usec;
}
int main() {
long long start, delta, sum;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += countBits(i);
delta = currentTime() - start;
cout << "countBits : sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i)
sum += __builtin_popcount(i);
delta = currentTime() - start;
cout << "__builtin_popcount: sum=" << sum << ": time (usec)=" << delta << endl;
start = currentTime();
sum = 0;
for(unsigned i = 0; i < 100000000; ++i) {
int count;
asm("popcnt %1,%0" : "=r"(count) : "rm"(i) : "cc");
sum += count;
}
delta = currentTime() - start;
cout << "assembler : sum=" << sum << ": time (usec)=" << delta << endl;
return 0;
}
起初,我是使用较旧的编译器运行的:
At first I ran this with an older compiler:
> g++ --version | head -1
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=148506
__builtin_popcount: sum=1314447104: time (usec)=345122
assembler : sum=1314447104: time (usec)=138036
如您所见,基于表的countBits几乎与汇编程序一样快,并且比__builtin_popcount快得多.然后,我在另一台机器类型(相同的处理器上,并且我认为主板也相同)上尝试了一种较新的编译器:
As you can see, the table-based countBits is almost as fast as the assembler and far-faster than __builtin_popcount. Then I tried a newer compiler on a different machine type (same processor -- and I think the mother board's the same too):
> g++ --version | head -1
g++ (Ubuntu 7.3.0-16ubuntu3) 7.3.0
> cat /proc/cpuinfo | grep 'model name' | head -1
model name : Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> g++ -O3 popcountTest.cpp
> ./a.out
countBits : sum=1314447104: time (usec)=164247
__builtin_popcount: sum=1314447104: time (usec)=345167
assembler : sum=1314447104: time (usec)=138028
奇怪的是,较旧的编译器比较新的编译器对我的countBits函数进行了更好的优化,但与汇编程序相比仍具有优势.显然,由于汇编程序行可以编译和运行,因此我的处理器支持popcount,但是为什么__builtin_popcount慢两倍以上?我自己的例程又如何与基于硅的弹出式表盘竞争?我在寻找第一个置位位的其他例程方面也有相同的经验.我的例程都比GNU内置"等效例程快得多.
Curiously, the older compiler optimized my countBits function better than the newer compiler, but it still compares favorably with the assembler. Clearly since the assembler line compiles and runs, my processor supports popcount, but why then is __builtin_popcount more than two times slower? And how can my own routine possibly compete with the silicon-based popcount? I'm having the same experience with other routines for finding the first set bit, etc. My routines are all significantly faster than the GNU "builtin" equivalents.
(顺便说一句,我不知道如何编写汇编程序.我只是在某些网页上发现了该行,而且它似乎奇迹般地起作用了.)
(BTW, I have no clue how to write assembler. I just found that line on some web page and it miraculously seemed to work.)
推荐答案
如果不在命令行上指定适当的"-march",则gcc会生成对 __ popcountdi2
函数而不是的调用popcnt
指令.请参阅: https://godbolt.org/z/z1BihM
Without specifying an appropriate "-march" on the command line gcc generates a call to the __popcountdi2
function rather than the popcnt
instruction. See: https://godbolt.org/z/z1BihM
根据维基百科,自Nehalem以来,Intel就支持POPCNT,而自巴塞罗那以来,AMD就支持POPCNT: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
POPCNT is supported by Intel since Nehalem and AMD since Barcelona according to wikipedia: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
这篇关于为什么__builtin_popcount比我自己的位计数功能慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!