在计算中避免使用分行的bool [英] Using bools in calculations to avoid branches
问题描述
下面是我想出了一个小微优化的好奇心:
结构定时器{
布尔运行{FALSE};
INT蜱{0}; 无效step_versionOne(INT mStepSize){
如果(运行)蜱+ = mStepSize;
} 无效step_versionTwo(INT mStepSize){
蜱+ = mStepSize *的static_cast< INT>(运行);
}
};
看来这两种方法几乎做同样的事情。请问第二个版本避免分支(因此,比第一版本快),或者是能够做到这种优化与 -O3
?
是的,你的绝招允许避免分支,它使我们更快......有时。
我写了基准测试,比较在不同情况下这些解决方案,用我自己的一起:
蜱+ = mStepSize&安培; -static_cast&所述; INT>(运行)
我的结果如下:
关:
分支:399949150
MUL:399940271
andneg:277546678
上:
分支:204035423
MUL:399937142
andneg:277581853
模式:
分支:327724860
MUL:400010363
andneg:277551446
随机:
分支:915235440
MUL:399916440
andneg:277537411
关闭
当定时器被关闭。在这种情况下,解决方案将大约在同一时间。
在
是当他们打开。分支解决方案快两倍。
模式
是当他们在100110格局。性能类似,但分支是有点快。
随机
当分支未predictable。在这种情况下,乘法是快2倍以上。
在所有情况下,我有点黑客伎俩是最快的,除了在
,其中分支胜。
请注意,这个基准是不可重presentative所有的编译器版本的处理器等。即使是很小的基准可以把结果倒置(例如,如果编译器可以内联知道的变化 mStepSize
是 1
比乘法实际上可以最快)。
基准的code:
#包括LT&;阵列GT;
#包括LT&;&iostream的GT;
#包括LT&;&计时GT;结构定时器{
布尔运行{FALSE};
INT蜱{0}; 无效分支(INT mStepSize){
如果(运行)蜱+ = mStepSize;
} 无效MUL(INT mStepSize){
蜱+ = mStepSize *的static_cast< INT>(运行);
} 无效andneg(INT mStepSize){
蜱+ = mStepSize&安培; -static_cast&所述; INT>(运行);
}
};无效的run(的std ::阵列<定时器,256>&安培;定时器,INT步骤){
自动启动=的std ::时辰:: steady_clock ::现在();
的for(int i = 0; I< 1000000;我++)
为(自动& T公司:计时器)
t.branch(步骤);
自动结束=的std ::时辰:: steady_clock ::现在();
性病::法院LT&;< 分支:<< (结束 - 开始).Count之间的()<<的std :: ENDL;
开始=的std ::时辰:: steady_clock ::现在();
的for(int i = 0; I< 1000000;我++)
为(自动& T公司:计时器)
t.mul(步骤);
最终=的std ::时辰:: steady_clock ::现在();
性病::法院LT&;< MUL:&所述;&下; (结束 - 开始).Count之间的()<<的std :: ENDL;
开始=的std ::时辰:: steady_clock ::现在();
的for(int i = 0; I< 1000000;我++)
为(自动& T公司:计时器)
t.andneg(步骤);
最终=的std ::时辰:: steady_clock ::现在();
性病::法院LT&;< andneg:&所述;&下; (结束 - 开始).Count之间的()<<的std :: ENDL;
}诠释主(){
的std ::阵列<定时器,256>定时器;
INT步=兰特()%256; 运行(定时器,步骤); // 暖身
性病::法院LT&;< 关:\\ n;
运行(定时器,步骤);
为(自动& T公司:计时器)
t.running = TRUE;
性病::法院LT&;< 在:\\ n;
运行(定时器,步骤);
的std ::阵列<布尔,6'图案= {1,0,0,1,1,0};
的for(int i = 0; I< 256;我++)
计时器[I] .running =模式[我%6]。
性病::法院LT&;< 模式:\\ N的;
运行(定时器,步骤);
为(自动& T公司:计时器)
t.running = RAND()及1;
性病::法院LT&;< 随机:\\ n;
运行(定时器,步骤);
为(自动& T公司:计时器)
性病::法院LT&;< t.ticks所述&;&下; '';
返回0;
}
Here's a little micro-optimization curiosity that I came up with:
struct Timer {
bool running{false};
int ticks{0};
void step_versionOne(int mStepSize) {
if(running) ticks += mStepSize;
}
void step_versionTwo(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
};
It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3
?
Yes, your trick allows to avoid branch and it makes it faster... sometimes.
I wrote benchmark that compares these solutions in various situations, along with my own:
ticks += mStepSize & -static_cast<int>(running)
My results are following:
Off:
branch: 399949150
mul: 399940271
andneg: 277546678
On:
branch: 204035423
mul: 399937142
andneg: 277581853
Pattern:
branch: 327724860
mul: 400010363
andneg: 277551446
Random:
branch: 915235440
mul: 399916440
andneg: 277537411
Off
is when timers are turned off. In this cases solutions take about the same time.
On
is when they are turned on. Branching solution two times faster.
Pattern
is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.
Random
is when branch is unpredictable. In this case multiplications is more than 2 times faster.
In all cases my bit-hacking trick is fastest, except for On
where branching wins.
Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize
is 1
than multiplication can be actually fastest).
Code of the benchmark:
#include<array>
#include<iostream>
#include<chrono>
struct Timer {
bool running{false};
int ticks{0};
void branch(int mStepSize) {
if(running) ticks += mStepSize;
}
void mul(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
void andneg(int mStepSize) {
ticks += mStepSize & -static_cast<int>(running);
}
};
void run(std::array<Timer, 256>& timers, int step) {
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.branch(step);
auto end = std::chrono::steady_clock::now();
std::cout << "branch: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.mul(step);
end = std::chrono::steady_clock::now();
std::cout << "mul: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.andneg(step);
end = std::chrono::steady_clock::now();
std::cout << "andneg: " << (end - start).count() << std::endl;
}
int main() {
std::array<Timer, 256> timers;
int step = rand() % 256;
run(timers, step); // warm up
std::cout << "Off:\n";
run(timers, step);
for(auto& t : timers)
t.running = true;
std::cout << "On:\n";
run(timers, step);
std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
for(int i = 0; i < 256; i++)
timers[i].running = pattern[i % 6];
std::cout << "Pattern:\n";
run(timers, step);
for(auto& t : timers)
t.running = rand()&1;
std::cout << "Random:\n";
run(timers, step);
for(auto& t : timers)
std::cout << t.ticks << ' ';
return 0;
}
这篇关于在计算中避免使用分行的bool的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!