分支预测如何影响R中的性能? [英] How does Branch Prediction affect performance in R?

查看:81
本文介绍了分支预测如何影响R中的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一些参考文献:

这是此 的问题中的唯一帖子我发现与分支预测有些相关的标签是为什么采样矩阵行很慢?

The only post in r tag that I found somewhat related to branch prediction was this Why sampling matrix row is very slow?

问题的说明:

我正在研究处理排序数组是否比处理未排序数组快(与JavaC –第一个链接中测试的问题相同),以查看分支预测是否以相同方式影响R

I was investigating whether processing a sorted array is faster than processing an unsorted one (same as the problem tested in Java and C – first link) to see if branch prediction is affecting R in the same manner.

请参阅下面的基准示例:

See the benchmark examples below:

set.seed(128)
#or making a vector with 1e7
myvec <- rnorm(1e8, 128, 128)  

myvecsorted <- sort(myvec)

mysumU = 0
mysumS = 0

SvU <- microbenchmark::microbenchmark(
  Unsorted = for (i in 1:length(myvec)) {
    
    if (myvec[i] > 128) {
      mysumU = mysumU + myvec[i]
    }
    
  } ,
  Sorted = for (i in 1:length(myvecsorted)) {
    
    if (myvecsorted[i] > 128) {
      mysumS = mysumS + myvecsorted[i]
    }
    
  } ,
  times = 10)

ggplot2::autoplot(SvU)

  • 首先,我想知道为什么排序的" 向量不是一直都是最快的,而且幅度也不如Java所示?
  • 第二,为什么排序的执行时间与未排序的执行时间相比有较高的变化?
  • First, I want to know that why "Sorted" vector is not the fastest all the time and not by the same magnitude as expressed in Java?
  • Second, why the sorted execution time has a higher variation compared to one of the unsorted?

N.B. 我的CPU是 i7-6820HQ @ 2.70GHz Skylake,具有超线程的四核.

为了研究变异部分,我用1亿个元素的向量(n=1e8)进行了microbenchmark,并重复了基准100次(times=100).这是该基准的关联图.

To investigate the variation part, I did the microbenchmark with the vector of 100 million elements (n=1e8) and repeated the benchmark 100 times (times=100). Here's the associated plot with that benchmark.

这是我的sessioninfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstudioapi_0.10      reprex_0.3.0         cli_1.1.0            pkgconfig_2.0.3      evaluate_0.14        rlang_0.4.0         
[7] Rcpp_1.0.2           microbenchmark_1.4-7 ggplot2_3.2.1 

推荐答案

解释器的开销以及成为的解释器,可以解释大多数平均差异.对于较高的差异,我没有任何解释.

Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.

R是一种解释型语言,不是JIT编译为Java之类的机器代码,也不是像C这样的提前编译的语言.在此处进行很多的假设.)

R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)

在实际CPU硬件上运行的代码是 R解释器,而不完全是您的R程序.

The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.

控件依赖项(如if())在解释器中成为 data 依赖项.当前正在执行的只是在真实CPU上运行的解释器的数据.

Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.

R程序中的不同操作成为解释器中的控件依赖项.例如,先评估myvec[i]然后+运算符可能由解释器中的两个不同函数完成.还有一个单独的函数,用于>if()语句.

Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.

经典的解释器循环基于从函数指针表中分派的间接分支. CPU需要一个对许多最近使用的目标地址之一的预测.我不知道R是否使用像这样的单个间接分支,或者是否试图像将每个解释器块的末尾分派到下一个分支一样,而不是返回到主分派循环,而变得更加幻想.

The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.

现代Intel CPU(例如Haswell和更高版本)具有IT-TAGE(间接捕获的几何历史记录长度)预测.沿执行路径的先前分支的已采用/未采用状态用作预测表的索引.这主要解决了解释器分支预测问题,使其工作出奇地好,尤其是当解释的代码(在您的情况下为R代码)重复执行相同的操作时.

Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.

  • Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
  • X86 prefetching optimizations: "computed goto" threaded code has more links.
  • https://comparch.net/2013/06/30/why-tage-is-the-best/
  • https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.

>被执行 导致需要执行不同的操作,因此它实际上仍或多或少地在R解释器中进行了某些分支. 当然,作为解释器,与在数组上进行简单的机器代码循环相比,它在每一步上要做的工作要多得多.

The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.

因此,由于解释器开销,额外的分支错误预测占总时间的比例要小得多.

当然,您的两个测试都在相同的硬件上使用相同的解释器.我不知道您使用的是哪种CPU.

Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.

如果它的Intel版本早于Haswell或AMD版本早于Zen,那么即使使用排序数组,也可能会出现很多错误的预测,除非该模式足够简单,以至于间接的分支历史预测器无法锁定.这样可以将差异掩盖在更多的噪音中.

If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.

由于您确实看到了非常明显的差异,所以我猜想在排序情况下CPU不会预测太多,因此在未排序情况下它有变坏的空间.

Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

这篇关于分支预测如何影响R中的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆