在计算中避免使用分行的bool [英] Using bools in calculations to avoid branches

查看:155
本文介绍了在计算中避免使用分行的bool的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面是我想出了一个小微优化的好奇心:

 结构定时器{
    布尔运行{FALSE};
    INT蜱{0};    无效step_versionOne(INT mStepSize){
        如果(运行)蜱+ = mStepSize;
    }    无效step_versionTwo(INT mStepSize){
        蜱+ = mStepSize *的static_cast< INT>(运行);
    }
};

看来这两种方法几乎做同样的事情。请问第二个版本避免分支(因此,比第一版本快),或者是能够做到这种优化与 -O3

解决方案

是的,你的绝招允许避免分支,它使我们更快......有时。

我写了基准测试,比较在不同情况下这些解决方案,用我自己的一起:

 蜱+ = mStepSize&安培; -static_cast&所述; INT>(运行)

我的结果如下:

 关:
 分支:399949150
 MUL:399940271
 andneg:277546678
上:
 分支:204035423
 MUL:399937142
 andneg:277581853
模式:
 分支:327724860
 MUL:400010363
 andneg:277551446
随机:
 分支:915235440
 MUL:399916440
 andneg:277537411

关闭当定时器被关闭。在这种情况下,解决方案将大约在同一时间。

是当他们打开。分支解决方案快两倍。

模式是当他们在100110格局。性能类似,但分支是有点快。

随机当分支未predictable。在这种情况下,乘法是快2倍以上。

在所有情况下,我有点黑客伎俩是最快的,除了,其中分支胜。

请注意,这个基准是不可重presentative所有的编译器版本的处理器等。即使是很小的基准可以把结果倒置(例如,如果编译器可以内联知道的变化 mStepSize 1 比乘法实际上可以最快)。

基准的code:

 #包括LT&;阵列GT;
#包括LT&;&iostream的GT;
#包括LT&;&计时GT;结构定时器{
    布尔运行{FALSE};
    INT蜱{0};    无效分支(INT mStepSize){
        如果(运行)蜱+ = mStepSize;
    }    无效MUL(INT mStepSize){
        蜱+ = mStepSize *的static_cast< INT>(运行);
    }    无效andneg(INT mStepSize){
        蜱+ = mStepSize&安培; -static_cast&所述; INT>(运行);
    }
};无效的run(的std ::阵列<定时器,256>&安培;定时器,INT步骤){
    自动启动=的std ::时辰:: steady_clock ::现在();
    的for(int i = 0; I< 1000000;我++)
        为(自动& T公司:计时器)
            t.branch(步骤);
    自动结束=的std ::时辰:: steady_clock ::现在();
    性病::法院LT&;< 分支:<< (结束 - 开始).Count之间的()<<的std :: ENDL;
    开始=的std ::时辰:: steady_clock ::现在();
    的for(int i = 0; I< 1000000;我++)
        为(自动& T公司:计时器)
            t.mul(步骤);
    最终=的std ::时辰:: steady_clock ::现在();
    性病::法院LT&;< MUL:&所述;&下; (结束 - 开始).Count之间的()<<的std :: ENDL;
    开始=的std ::时辰:: steady_clock ::现在();
    的for(int i = 0; I< 1000000;我++)
        为(自动& T公司:计时器)
            t.andneg(步骤);
    最终=的std ::时辰:: steady_clock ::现在();
    性病::法院LT&;< andneg:&所述;&下; (结束 - 开始).Count之间的()<<的std :: ENDL;
}诠释主(){
    的std ::阵列<定时器,256>定时器;
    INT步=兰特()%256;    运行(定时器,步骤); // 暖身
    性病::法院LT&;< 关:\\ n;
    运行(定时器,步骤);
    为(自动& T公司:计时器)
        t.running = TRUE;
    性病::法院LT&;< 在:\\ n;
    运行(定时器,步骤);
    的std ::阵列<布尔,6'图案= {1,0,0,1,1,0};
    的for(int i = 0; I< 256;我++)
        计时器[I] .running =模式[我%6]。
    性病::法院LT&;< 模式:\\ N的;
    运行(定时器,步骤);
    为(自动& T公司:计时器)
        t.running = RAND()及1;
    性病::法院LT&;< 随机:\\ n;
    运行(定时器,步骤);
    为(自动& T公司:计时器)
        性病::法院LT&;< t.ticks所述&;&下; '';
    返回0;
}

Here's a little micro-optimization curiosity that I came up with:

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void step_versionTwo(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }
};

It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3?

解决方案

Yes, your trick allows to avoid branch and it makes it faster... sometimes.

I wrote benchmark that compares these solutions in various situations, along with my own:

ticks += mStepSize & -static_cast<int>(running)

My results are following:

Off:
 branch: 399949150
 mul:    399940271
 andneg: 277546678
On:
 branch: 204035423
 mul:    399937142
 andneg: 277581853
Pattern:
 branch: 327724860
 mul:    400010363
 andneg: 277551446
Random:
 branch: 915235440
 mul:    399916440
 andneg: 277537411

Off is when timers are turned off. In this cases solutions take about the same time.

On is when they are turned on. Branching solution two times faster.

Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.

Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.

In all cases my bit-hacking trick is fastest, except for On where branching wins.

Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).

Code of the benchmark:

#include<array>
#include<iostream>
#include<chrono>

struct Timer {
    bool running{false};
    int ticks{0};

    void branch(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void mul(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }

    void andneg(int mStepSize) {
        ticks += mStepSize & -static_cast<int>(running);
    }
};

void run(std::array<Timer, 256>& timers, int step) {
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.branch(step);
    auto end = std::chrono::steady_clock::now();
    std::cout << "branch: " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.mul(step);
    end = std::chrono::steady_clock::now();
    std::cout << "mul:    " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.andneg(step);
    end = std::chrono::steady_clock::now();
    std::cout << "andneg: " << (end - start).count() << std::endl;
}

int main() {
    std::array<Timer, 256> timers;
    int step = rand() % 256;

    run(timers, step); // warm up
    std::cout << "Off:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = true;
    std::cout << "On:\n";
    run(timers, step);
    std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
    for(int i = 0; i < 256; i++)
        timers[i].running = pattern[i % 6];
    std::cout << "Pattern:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = rand()&1;
    std::cout << "Random:\n";
    run(timers, step);
    for(auto& t : timers)
        std::cout << t.ticks << ' ';
    return 0;
}

这篇关于在计算中避免使用分行的bool的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆