为什么标准 R 中值函数比简单的 C++ 替代方法慢这么多? [英] Why is standard R median function so much slower than a simple C++ alternative?
问题描述
我在 C++
中实现了以下中位数,并通过 Rcpp
在 R
中使用它:
I made the following implementation of the median in C++
and and used it in R
via Rcpp
:
// [[Rcpp::export]]
double median2(std::vector<double> x){
double median;
size_t size = x.size();
sort(x.begin(), x.end());
if (size % 2 == 0){
median = (x[size / 2 - 1] + x[size / 2]) / 2.0;
}
else {
median = x[size / 2];
}
return median;
}
如果我随后将性能与标准的内置 R 中值函数进行比较,我会通过 microbenchmark
If I subsequently compare the performance with the standard built-in R median function, I get the following results via microbenchmark
> x = rnorm(100)
> microbenchmark(median(x),median2(x))
Unit: microseconds
expr min lq mean median uq max neval
median(x) 25.469 26.990 34.96888 28.130 29.081 518.126 100
median2(x) 1.140 1.521 2.47486 1.901 2.281 47.897 100
为什么标准中值函数要慢这么多?这不是我所期望的...
Why is the standard median function so much slower? This isn't what I would expect...
推荐答案
正如@joran 所指出的,您的代码非常专业,一般来说,不太通用的函数、算法等...通常性能更高.看看median.default
:
As noted by @joran, your code is very specialized, and generally speaking, less generalized functions, algorithms, etc... are often more performant. Take a look at median.default
:
median.default
# function (x, na.rm = FALSE)
# {
# if (is.factor(x) || is.data.frame(x))
# stop("need numeric data")
# if (length(names(x)))
# names(x) <- NULL
# if (na.rm)
# x <- x[!is.na(x)]
# else if (any(is.na(x)))
# return(x[FALSE][NA])
# n <- length(x)
# if (n == 0L)
# return(x[FALSE][NA])
# half <- (n + 1L)%/%2L
# if (n%%2L == 1L)
# sort(x, partial = half)[half]
# else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
# }
有几个操作来适应缺失值的可能性,这些肯定会影响函数的整体执行时间.由于您的函数不会复制此行为,因此可以消除大量计算,但因此不会为具有缺失值的向量提供相同的结果:
There are several operations in place to accommodate the possibility of missing values, and these will definitely impact the overall execution time of the function. Since your function does not replicate this behavior it can eliminate a bunch of calculations, but consequently will not provide the same result for vectors with missing values:
median(c(1, 2, NA))
#[1] NA
median2(c(1, 2, NA))
#[1] 2
<小时>
其他一些因素可能没有和NA
的处理那么大的影响,但值得指出:
A couple of other factors which probably don't have as much of an effect as the handling of NA
s, but are worth pointing out:
median
,连同它使用的一些函数,都是 S3 泛型,所以在方法分派上花费了少量时间median
不仅仅适用于整数和数字向量;它还可以处理Date
、POSIXt
和一些其他类,并正确保留属性:
median
, along with a handful of the functions it uses, are S3 generics, so there is a small amount of time spent on method dispatchmedian
will work with more than just integer and numeric vectors; it will also handleDate
,POSIXt
, and probably a bunch of other classes, and preserve attributes correctly:
median(Sys.Date() + 0:4)
#[1] "2016-01-15"
median(Sys.time() + (0:4) * 3600 * 24)
#[1] "2016-01-15 11:14:31 EST"
<小时>
我应该提到,下面的函数 将导致原始向量被排序,因为 NumericVector
是代理对象.如果您想避免这种情况,您可以 Rcpp::clone
输入向量并对克隆进行操作,或者使用您的原始签名(使用 std::vector
code>),这在从 SEXP
到 std::vector
的转换中隐含地需要一个副本.
I should mention that the function below will cause the original vector to be sorted since NumericVector
s are proxy objects. If you want to avoid this, you can either Rcpp::clone
the input vector and operate on the clone, or use your original signature (with a std::vector<double>
), which implicitly requires a copy in the conversion from SEXP
to std::vector
.
另请注意,您可以使用 NumericVector
而不是 std::vector
来节省更多时间:
Also note that you can shave off a little more time by using a NumericVector
instead of a std::vector<double>
:
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med(Rcpp::NumericVector x){
std::size_t size = x.size();
std::sort(x.begin(), x.end());
if (size % 2 == 0) return (x[size / 2 - 1] + x[size / 2]) / 2.0;
return x[size / 2];
}
<小时>
microbenchmark::microbenchmark(
median(x),
median2(x),
cpp_med(x),
times = 200L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# median(x) 74.787 81.6485 110.09870 92.5665 129.757 293.810 200
# median2(x) 6.474 7.9665 13.90126 11.0570 14.844 151.817 200
# cpp_med(x) 5.737 7.4285 11.25318 9.0270 13.405 52.184 200
<小时>
Yakk 在上面的评论中提出了一个很好的观点 - 也由 Jerry Coffin 详细阐述 - 关于执行完整排序的低效率.这是使用 std::nth_element
的重写,以更大的向量为基准:
Yakk brought up a great point in the comments above - also elaborated on by Jerry Coffin - about the inefficiency of doing a complete sort. Here's a rewrite using std::nth_element
, benchmarked on a much larger vector:
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med2(Rcpp::NumericVector xx) {
Rcpp::NumericVector x = Rcpp::clone(xx);
std::size_t n = x.size() / 2;
std::nth_element(x.begin(), x.begin() + n, x.end());
if (x.size() % 2) return x[n];
return (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}
<小时>
set.seed(123)
xx <- rnorm(10e5)
all.equal(cpp_med2(xx), median(xx))
all.equal(median2(xx), median(xx))
microbenchmark::microbenchmark(
cpp_med2(xx), median2(xx),
median(xx), times = 200L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# cpp_med2(xx) 10.89060 11.34894 13.15313 12.72861 13.56161 33.92103 200
# median2(xx) 84.29518 85.47184 88.57361 86.05363 87.70065 228.07301 200
# median(xx) 46.18976 48.36627 58.77436 49.31659 53.46830 250.66939 200
这篇关于为什么标准 R 中值函数比简单的 C++ 替代方法慢这么多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!