为什么标准 R 中值函数比简单的 C++ 替代方法慢这么多? [英] Why is standard R median function so much slower than a simple C++ alternative?

查看:68
本文介绍了为什么标准 R 中值函数比简单的 C++ 替代方法慢这么多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 C++ 中实现了以下中位数,并通过 RcppR 中使用它:

I made the following implementation of the median in C++ and and used it in R via Rcpp:

// [[Rcpp::export]]
double median2(std::vector<double> x){
  double median;
  size_t size = x.size();
  sort(x.begin(), x.end());
  if (size  % 2 == 0){
      median = (x[size / 2 - 1] + x[size / 2]) / 2.0;
  }
  else {
      median = x[size / 2];
  }
  return median;
}

如果我随后将性能与标准的内置 R 中值函数进行比较,我会通过 microbenchmark

If I subsequently compare the performance with the standard built-in R median function, I get the following results via microbenchmark

> x = rnorm(100)
> microbenchmark(median(x),median2(x))
Unit: microseconds
       expr    min     lq     mean median     uq     max neval
  median(x) 25.469 26.990 34.96888 28.130 29.081 518.126   100
 median2(x)  1.140  1.521  2.47486  1.901  2.281  47.897   100

为什么标准中值函数要慢这么多?这不是我所期望的...

Why is the standard median function so much slower? This isn't what I would expect...

推荐答案

正如@joran 所指出的,您的代码非常专业,一般来说,不太通用的函数、算法等...通常性能更高.看看median.default:

As noted by @joran, your code is very specialized, and generally speaking, less generalized functions, algorithms, etc... are often more performant. Take a look at median.default:

median.default
# function (x, na.rm = FALSE) 
# {
#   if (is.factor(x) || is.data.frame(x)) 
#     stop("need numeric data")
#   if (length(names(x))) 
#     names(x) <- NULL
#   if (na.rm) 
#     x <- x[!is.na(x)]
#   else if (any(is.na(x))) 
#     return(x[FALSE][NA])
#   n <- length(x)
#   if (n == 0L) 
#     return(x[FALSE][NA])
#   half <- (n + 1L)%/%2L
#   if (n%%2L == 1L) 
#     sort(x, partial = half)[half]
#   else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
# }

有几个操作来适应缺失值的可能性,这些肯定会影响函数的整体执行时间.由于您的函数不会复制此行为,因此可以消除大量计算,但因此不会为具有缺失值的向量提供相同的结果:

There are several operations in place to accommodate the possibility of missing values, and these will definitely impact the overall execution time of the function. Since your function does not replicate this behavior it can eliminate a bunch of calculations, but consequently will not provide the same result for vectors with missing values:

median(c(1, 2, NA))
#[1] NA

median2(c(1, 2, NA))
#[1] 2

<小时>

其他一些因素可能没有NA的处理那么大的影响,但值得指出:


A couple of other factors which probably don't have as much of an effect as the handling of NAs, but are worth pointing out:

  • median,连同它使用的一些函数,都是 S3 泛型,所以在方法分派上花费了少量时间
  • median 不仅仅适用于整数和数字向量;它还可以处理 DatePOSIXt 和一些其他类,并正确保留属性:
  • median, along with a handful of the functions it uses, are S3 generics, so there is a small amount of time spent on method dispatch
  • median will work with more than just integer and numeric vectors; it will also handle Date, POSIXt, and probably a bunch of other classes, and preserve attributes correctly:
median(Sys.Date() + 0:4)
#[1] "2016-01-15"

median(Sys.time() + (0:4) * 3600 * 24)
#[1] "2016-01-15 11:14:31 EST"

<小时>

我应该提到,下面的函数 将导致原始向量被排序,因为 NumericVector 是代理对象.如果您想避免这种情况,您可以 Rcpp::clone 输入向量并对克隆进行操作,或者使用您的原始签名(使用 std::vectorcode>),这在从 SEXPstd::vector 的转换中隐含地需要一个副本.


I should mention that the function below will cause the original vector to be sorted since NumericVectors are proxy objects. If you want to avoid this, you can either Rcpp::clone the input vector and operate on the clone, or use your original signature (with a std::vector<double>), which implicitly requires a copy in the conversion from SEXP to std::vector.

另请注意,您可以使用 NumericVector 而不是 std::vector 来节省更多时间:

Also note that you can shave off a little more time by using a NumericVector instead of a std::vector<double>:

#include <Rcpp.h>

// [[Rcpp::export]]
double cpp_med(Rcpp::NumericVector x){
  std::size_t size = x.size();
  std::sort(x.begin(), x.end());
  if (size  % 2 == 0) return (x[size / 2 - 1] + x[size / 2]) / 2.0;
  return x[size / 2];
}

<小时>

microbenchmark::microbenchmark(
  median(x),
  median2(x),
  cpp_med(x),
  times = 200L
)
# Unit: microseconds
#       expr    min      lq      mean  median      uq     max neval
#  median(x) 74.787 81.6485 110.09870 92.5665 129.757 293.810   200
# median2(x)  6.474  7.9665  13.90126 11.0570  14.844 151.817   200
# cpp_med(x)  5.737  7.4285  11.25318  9.0270  13.405  52.184   200

<小时>

Yakk 在上面的评论中提出了一个很好的观点 - 也由 Jerry Coffin 详细阐述 - 关于执行完整排序的低效率.这是使用 std::nth_element 的重写,以更大的向量为基准:


Yakk brought up a great point in the comments above - also elaborated on by Jerry Coffin - about the inefficiency of doing a complete sort. Here's a rewrite using std::nth_element, benchmarked on a much larger vector:

#include <Rcpp.h>

// [[Rcpp::export]]
double cpp_med2(Rcpp::NumericVector xx) {
  Rcpp::NumericVector x = Rcpp::clone(xx);
  std::size_t n = x.size() / 2;
  std::nth_element(x.begin(), x.begin() + n, x.end());

  if (x.size() % 2) return x[n]; 
  return (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}

<小时>

set.seed(123)
xx <- rnorm(10e5)

all.equal(cpp_med2(xx), median(xx))
all.equal(median2(xx), median(xx))

microbenchmark::microbenchmark(
  cpp_med2(xx), median2(xx), 
  median(xx), times = 200L
)
# Unit: milliseconds
#         expr      min       lq     mean   median       uq       max neval
# cpp_med2(xx) 10.89060 11.34894 13.15313 12.72861 13.56161  33.92103   200
#  median2(xx) 84.29518 85.47184 88.57361 86.05363 87.70065 228.07301   200
#   median(xx) 46.18976 48.36627 58.77436 49.31659 53.46830 250.66939   200

这篇关于为什么标准 R 中值函数比简单的 C++ 替代方法慢这么多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆