程序完成后分支预测变量条目无效吗? [英] Branch Predictor Entries Invalidation upon program finishes?

查看:74
本文介绍了程序完成后分支预测变量条目无效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解何时使分支预测变量条目无效。

I am trying to understand when branch predictor entries are invalidated.

这里是我做过的实验:

代码1:

start_measure_branch_mispred()
while(X times):
 if(something something):
  do_useless()
 endif
endwhile
end_measurement()
store_difference()

因此,我多次运行此代码。我可以看到,第一次运行后,错误预测率会降低。分支预测器学习如何正确预测。但是,如果我一次又一次地运行此实验(即通过将 ./ experiment 写入终端),则所有的第一次迭代都是从错误预测率高的位置开始的。因此,在每次执行时,那些条件分支的分支预测单元将失效。我正在使用 nokaslr ,并且已禁用 ASLR 。我还在孤立的内核上运行了该实验。我已经进行了几次实验,以确保这是行为(即不是由于噪音)。

So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).

我的问题是:CPU是否会使分支预测单元无效程序停止执行后?还是这是什么原因?

My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?

我做的第二个实验是:

代码2:

do:
    start_measure_branch_mispred()
    while(X times):
      if(something something):
        do_useless()
      endif
    endwhile
    end_measurement()
    store_difference()
while(cpu core == 1)

在此实验中,我从两个不同的终端运行不同的进程。第一个固定在 core 1 上,以便它可以在core 1上运行,并且它将进行此实验,直到我停止它(通过杀死它)为止。然后,我从另一个终端运行第二个进程,并将该进程固定到不同的内核。由于此进程位于另一个内核中,因此它将仅执行一次do-while循环1次。如果将第二个进程固定到第一个进程的兄弟核心(相同的物理核心),则可以看到在第一次迭代中,第二个进程几乎可以正确猜测。如果我将第二个进程固定到另一个核心而不是第一个核心,则第二个进程的第一次迭代会产生更高的错误预测。这是预期的结果,因为同一物理核心上的虚拟核心共享相同的分支预测单位(这是我的假设)。因此,第二个过程使受过训练的分支预测单元具有相同的虚拟地址并映射到相同的分支预测单元条目,从而使它们受益。

In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.

据我了解,因为CPU没有完成第一个进程(执行繁忙循环的核心1进程),分支预测条目仍然存在,第二个进程可以从中受益。但是,在第一个示例中,从一个运行到另一个运行,我得到了更高的错误预测。

As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.

编辑:当另一个用户要求输入代码时,就在这里。您需要下载性能事件标头代码从这里开始

As the other user asked for the code, here it is. You need to download performance events header code from here

要编译: $(CXX)-std = c ++ 11 -O0 main.cpp -lpthread -o实验

代码:

#include "linux-perf-events.h"

#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>

// some array
int arr8[8] = {1,1,0,0,0,1,0,1};

int pin_thread_to_core(int core_id){            
    int retval;     
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);      
    if (core_id < 0 || core_id >= num_cores)            
        retval = EINVAL;                                
    cpu_set_t cpuset;                                   
    CPU_ZERO(&cpuset);                                  
    CPU_SET(core_id, &cpuset);                          
    retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    return retval;
}

void measurement(int cpuid, uint64_t howmany, int* branch_misses){

    int retval = pin_thread_to_core(cpuid);
    if(retval){
        printf("Affinity error: %s\n", strerror(errno));
        return;
    }

    std::vector<int> evts;
    evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!

    LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid

    uint64_t *buffer = new uint64_t[howmany + 1];
    uint64_t *buffer_org; // for restoring
    buffer_org = buffer;
    uint64_t howmany_org = howmany; // for restoring

    std::vector<unsigned long long> results;
    results.resize(evts.size());

    do{
        for(size_t trial = 0; trial < 10; trial++) {

            unified.start();
            // the while loop will be executed innerloop times
            int res;
            while(howmany){
                res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
                if(res){
                    *buffer++ = res;
                }       
                howmany--;
            }
            unified.end(results);
            // store misses
            branch_misses[trial] = results[0];
            // restore for next iteration
            buffer = buffer_org;
            howmany = howmany_org;
        }
    }while(cpuid == 5); // the core that does busy loop

    // get rid of optimization
    howmany = (howmany + 1) * buffer[3];
    branch_misses[10] = howmany; // last entry is reserved for this dummy operation

    delete[] buffer;

}
void usage(){
    printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
    // as I have 11th core isolated, set affinity to that
    if(argc == 1){
        usage();
        return 1;
    }

    int exp = 16; // howmany

    int results[11];
    int cpuid = atoi(argv[1]); 

    measurement(cpuid, exp, results);

    printf("%d measurements\n", exp);

    printf("Trial\t\t\tBranchMiss\n");
    for (size_t trial = 0; trial < 10; trial++)
    {
        printf("%zu\t\t\t%d\n", trial, results[trial]);
    }
    return 0;
}

如果想尝试第一个代码,只需运行 ./实验1 两次。

If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.

如果要尝试第二个代码,请打开两个终端,运行。/experiment X 在第一个中,并运行 ./ experiment Y 在第二个中,其中X和Y是cpuid。

If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.

请注意,您可能没有相同的性能事件计数器。另外,请注意,您可能需要在busyloop中更改cpuid。

Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

推荐答案

因此,我进行了更多实验以减少噪声的影响(从 _start main()函数,或者从 syscalls 中断可能在两个程序执行之间发生,(系统调用和中断)可能会破坏分支预测变量。

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.

这是修改后的实验的伪代码:

Here is the pseudo-code of the modified experiment:

int main(int arg){ // arg is the iteration
   pin_thread_to_isolated_core()
   for i=0 to arg:
     measurement()
     std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
   endfor
   printresults() // print after all measurements are completed
}

void measurement(){
   initialization()
   for i=0 to 10:
      start_measurement()
      while(X times) // for the results below, X is 32
        a = arr8[an element] //sequence of 8,
        if(a is odd)
           do_sth()
        endif
      endwhile
      end_measurement()
      store_difference()
   endfor
}

结果:

例如,我将迭代次数设为3

For example, I give iteration as 3

Trial           BranchMiss
RUN:1
    0           16
    1           28
    2           3
    3           1
    ....  continues as 1
RUN:2
    0           16   // CPU forgets the sequence
    1           30
    2           2
    3           1
    ....  continues as 1
RUN:3
    0           16
    1           27
    2           4
    3           1
    ....  continues as 1

因此,即使是一毫秒的睡眠也会干扰分支预测单元。为什么会这样?如果我在这些测量之间不入睡,CPU可以正确猜测,即Run2和Run3如下所示:

So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:

RUN:2
    0           1   
    1           1
    ....  continues as 1
RUN:3
    0           1
    1           1
    ....  continues as 1

我相信我减少了 _start 到测量点。尽管如此,CPU还是忘记了训练有素的东西。

I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

这篇关于程序完成后分支预测变量条目无效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆