用 Rcpp 优化 R 目标函数变慢,为什么? [英] Optimizing R objective function with Rcpp slower, why?

查看:41
本文介绍了用 Rcpp 优化 R 目标函数变慢,为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一种贝叶斯方法,该方法需要在每次迭代中对多项式 logit 模型进行多个优化步骤.我正在使用 optim() 来执行这些优化,并使用 R 编写的目标函数.分析显示 optim() 是主要瓶颈.

I am currently working on a Bayesian method that requires multiple steps of optimisation of a multinomial logit model per iteration. I am using optim() to perform those optimisations, and an objective function written in R. A profiling revealed that optim() is the main bottleneck.

在挖掘之后,我发现了这个问题,其中他们建议重新编码Rcpp 的目标函数可以加快这个过程.我遵循了建议并使用 Rcpp 重新编码了我的目标函数,但它最终变慢了(大约慢了两倍!).

After digging around, I found this question in which they suggest that recoding the objective function with Rcpp could speed up the process. I followed the suggestion and recoded my objective function with Rcpp, but it ended up being slower (about two times slower!).

这是我第一次使用 Rcpp(或任何与 C++ 相关的东西),我无法找到一种对代码进行矢量化的方法.知道如何让它更快吗?

This was my first time with Rcpp (or anything related to C++) and I was not able to find a way of vectorising the code. Any idea how to make it faster?

Tl;dr:当前 Rcpp 中函数的实现不如矢量化 R;如何让它更快?

一个可重现的例子:

  1. RRcpp 中定义目标函数:仅截取多项式模型的对数似然
  1. Define objective functions in R and Rcpp: log-likelihood of an intercept only multinomial model

library(Rcpp)
library(microbenchmark)

llmnl_int <- function(beta, Obs, n_cat) {
  n_Obs     <- length(Obs)
  Xint      <- matrix(c(0, beta), byrow = T, ncol = n_cat, nrow = n_Obs)
  ind       <- cbind(c(1:n_Obs), Obs)
  Xby       <- Xint[ind]
  Xint      <- exp(Xint)
  iota      <- c(rep(1, (n_cat)))
  denom     <- log(Xint %*% iota)
  return(sum(Xby - denom))
}

cppFunction('double llmnl_int_C(NumericVector beta, NumericVector Obs, int n_cat) {

    int n_Obs = Obs.size();
    
    NumericVector betas = (beta.size()+1);
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    
    NumericVector Xby = (n_Obs);
    NumericMatrix Xint(n_Obs, n_cat);
    NumericVector denom = (n_Obs);
    for (int i = 0; i < Xby.size(); i++) {
        Xint(i,_) = betas;
        Xby[i] = Xint(i,Obs[i]-1.0);
        Xint(i,_) = exp(Xint(i,_));
        denom[i] = log(sum(Xint(i,_)));
    };

    return sum(Xby - denom);
}')

  1. 比较它们的效率:

## Draw sample from a multinomial distribution
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

## Benchmarking
microbenchmark("llmml_int" = llmnl_int(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               "llmml_int_C" = llmnl_int_C(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               times = 100)
## Results
# Unit: microseconds
#         expr     min       lq     mean   median       uq     max neval
#    llmnl_int  76.809  78.6615  81.9677  79.7485  82.8495 124.295   100
#  llmnl_int_C 155.405 157.7790 161.7677 159.2200 161.5805 201.655   100

  1. 现在在 optim 中调用它们:
  1. Now calling them in optim:

## Benchmarking with optim
microbenchmark("llmnl_int" = optim(c(4,2,1), llmnl_int, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               "llmnl_int_C" = optim(c(4,2,1), llmnl_int_C, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               times = 100)
## Results
# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
#    llmnl_int 12.49163 13.26338 15.74517 14.12413 18.35461 26.58235   100
#  llmnl_int_C 25.57419 25.97413 28.05984 26.34231 30.44012 37.13442   100

我对 R 中的矢量化实现速度更快感到有些惊讶.在 Rcpp 中实现更高效的版本(例如,使用 RcppArmadillo?)可以产生任何收益吗?使用 C++ 优化器在 Rcpp 中重新编码所有内容是否更好?

I was somewhat surprised that the vectorised implementation in R was faster. Implementing a more efficient version in Rcpp (say, with RcppArmadillo?) can produce any gains? Is it a better idea to recode everything in Rcpp using a C++ optimiser?

推荐答案

一般来说,如果您能够使用矢量化函数,您会发现它(几乎)与直接在 Rcpp 中运行您的代码一样快.这是因为 R 中的许多向量化函数(几乎所有 Base R 中的向量化函数)都是用 C、Cpp 或 Fortran 编写的,因此通常没什么好处.

In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp. This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.

也就是说,您的 RRcpp 代码都有改进.优化来自仔细研究代码,并删除不必要的步骤(内存分配、求和等).

That said, there are improvements to gain both in your R and Rcpp code. Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).

让我们从 Rcpp 代码优化开始.

Lets start with the Rcpp code optimization.

在您的情况下,主要优化是删除不必要的矩阵和向量计算.代码本质上是

In your case the main optimization is to remove unnecessary matrix and vector calculations. The code is in essence

  1. Shift 测试版
  2. 计算 exp(shift beta) [log-sum-exp] 之和的对数
  3. 使用 Obs 作为偏移 Beta 的索引并对所有概率求和
  4. 减去 log-sum-exp

利用这一观察,我们可以将您的代码减少到 2 个 for 循环.请注意, sum 只是另一个 for 循环(或多或少:for(i = 0; i < max; i++){ sum += x }) 所以避免总和可以进一步加速代码(在大多数情况下,这是不必要的优化!).另外你输入的 Obs 是一个整数向量,我们可以通过使用 IntegerVector 类型来进一步优化代码,以避免将 double 元素转换为integer 值(感谢 Ralf Stubner 的回答).

Using this observation we can reduce your code to 2 for-loops. Note that sum is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!). In addition your input Obs is an integer vector, and we can further optimize the code by using the IntegerVector type to avoid casting the double elements to integer values (Credit to Ralf Stubner's answer).

cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
 {

    int n_Obs = Obs.size();

    NumericVector betas = (beta.size()+1);
    //1: shift beta
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    //2: Calculate log sum only once:
    double expBetas_log_sum = log(sum(exp(betas)));
    // pre allocate sum
    double ll_sum = 0;
    
    //3: Use n_Obs, to avoid calling Xby.size() every time 
    for (int i = 0; i < n_Obs; i++) {
        ll_sum += betas(Obs[i] - 1.0) ;
    };
    //4: Use that we know denom is the same for all I:
    ll_sum = ll_sum - expBetas_log_sum * n_Obs;
    return ll_sum;
}')

请注意,我已经删除了很多内存分配并删除了 for 循环中不必要的计算.我还使用了 denom 对于所有迭代都是相同的,并且只是相乘以获得最终结果.

Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop. Also i have used that denom is the same for all iterations and simply multiplied for the final result.

我们可以在您的 R 代码中执行类似的优化,从而产生以下功能:

We can perform similar optimizations in your R-code, which results in the below function:

llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
    n_Obs <- length(Obs)
    betas <- c(0, beta)
    #note: denom = log(sum(exp(betas)))
    sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}

请注意,函数的复杂性已大大降低,使其他人更容易阅读.为了确保我没有把代码弄乱,让我们检查一下它们是否返回相同的结果:

Note the complexity of the function has been drastically reduced making it simpler for others to read. Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:

set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

beta = c(4,2,1)
Obs = mnl_sample 
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE

嗯,这是一种解脱.

我将使用微基准测试来说明性能.优化后的函数速度很快,所以我会运行函数1e5次以减少垃圾收集器的影响

I'll use microbenchmark to illustrate the performance. The optimized functions are fast, so I'll run the functions 1e5 times to reduce the effect of the garbage collector

microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               times = 1e5)
#Output:
#Unit: microseconds
#           expr     min      lq       mean  median      uq        max neval
#    llmml_int_R 202.701 206.801 288.219673 227.601 334.301  57368.902 1e+05
#    llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2   4.800   5.601   8.930027   6.401   9.702   5232.001 1e+05
# llmml_int_C_v2   5.100   5.801   8.834646   6.700  10.101   7154.901 1e+05

这里我们看到了与之前相同的结果.现在,与它们的第一个对应部分相比,新功能大约快 35 倍 (R) 和快 40 倍 (Cpp).有趣的是,优化后的 R 函数仍然比我优化的 Cpp 函数快一点(0.3 毫秒或 4 %).我最好的选择是 Rcpp 包有一些开销,如果删除它,两者将是相同的,或者是 R.

Here we see the same result as before. Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts. Interestingly enough the optimized R function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp function. My best bet here is that there is some overhead from the Rcpp package, and if this was removed the two would be identical or the R.

同样,我们可以使用 Optim 检查性能.

Similarly we can check performance using Optim.

microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                   n_cat = n_cat, method = "BFGS", hessian = F, 
                                   control = list(fnscale = -1)),
               "llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               times = 1e3)
#Output:
#Unit: microseconds
#           expr       min        lq      mean    median         uq      max neval
#      llmnl_int 29541.301 53156.801 70304.446 76753.851  83528.101 196415.5  1000
#    llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1  1000
# llmnl_int_R_v2   667.802  1253.452  1962.875  1585.101   1984.151  22718.3  1000
# llmnl_int_C_v2   704.401  1248.200  1983.247  1671.151   2033.401  11540.3  1000

结果还是一样.

作为一个简短的结论,值得注意的是,这是一个示例,其中将您的代码转换为 Rcpp 并不值得麻烦.情况并非总是如此,但通常值得再次查看您的函数,以查看您的代码中是否存在执行不必要计算的区域.特别是在使用内置向量化函数的情况下,通常不值得花时间将代码转换为 Rcpp.如果将 for-loops 与无法轻松矢量化的代码一起使用以移除 for 循环,通常会看到很大的改进.

As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble. This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed. Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp. More often one can see great improvements if one uses for-loops with code that cant easily be vectorized in order to remove the for-loop.

这篇关于用 Rcpp 优化 R 目标函数变慢,为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆