Rcpp等效于rowsum [英] Rcpp equivalent for rowsum
问题描述
我正在寻找一个快速替代的R函数 rowsum
在C ++ / Rcpp / Eigen或Armadillo。
I am looking for a fast alternative for the R function rowsum
in C++ / Rcpp / Eigen or Armadillo.
目的是根据分组向量<$ c $获得向量中的元素的总和 a
c> b 。例如:
The purpose is to get the sum of elements in a vector a
according to a grouping vector b
. For example:
> a
[1] 2 2 2 2 2 2 2 2 2 2
> b
[1] 1 1 1 1 1 2 2 2 2 2
> rowsum(a,b)
[,1]
1 10
2 10
在 Rcpp
中编写一个简单的for循环很慢,但也许我的代码效率低下。
Writing a simple for loop in Rcpp
is very slow, but maybe my code was just inefficient.
我也试过调用中的
rowsum
>,但 rowsum
不是很快。
I tried also to call the function rowsum
in Rcpp
, however, rowsum
is not very fast.
推荐答案
这是我尝试使用 Rcpp
该包,所以这样做指出我的低效率):
Here's my attempt at doing this using Rcpp
(first time using the package, so do point out my inefficiencies):
library(inline)
library(Rcpp)
rowsum_helper = cxxfunction(signature(x = "numeric", y = "integer"), '
NumericVector var(x);
IntegerVector factor(y);
std::vector<double> sum(*std::max_element(factor.begin(), factor.end()) + 1,
std::numeric_limits<double>::quiet_NaN());
for (int i = 0, size = var.size(); i < size; ++i) {
if (sum[factor[i]] != sum[factor[i]]) sum[factor[i]] = var[i];
else sum[factor[i]] += var[i];
}
return NumericVector(sum.begin(), sum.end());
', plugin = "Rcpp")
rowsum_fast = function(x, y) {
res = rowsum_helper(x, y)
elements = which(!is.nan(res))
list(elements - 1, res[elements])
}
对于Martin的示例数据来说,这是非常快的,但只有当因子包含非负整数,并且将消耗内存在在因子向量中的最大整数(上面的一个显而易见的改进是从max减去min以减少内存使用 - 这可以在R函数或C ++中完成)。
It's pretty fast for Martin's example data, but will only work if the factor consists of non-negative integers and will consume memory on the order of the largest integer in the factor vector (one obvious improvement to the above is to subtract min from max to decrease memory usage - which can be done in either the R function or the C++ one).
n = 1e7; x = runif(n); f = sample(n/2, n, T)
system.time(rowsum(x,f))
# user system elapsed
# 14.241 0.170 14.412
system.time({tabulate(f); sum(x)})
# user system elapsed
# 0.216 0.027 0.252
system.time(rowsum_fast(x,f))
# user system elapsed
# 0.313 0.045 0.358
还要注意,在R代码中发生了很多减速(与 tabulate
相比),所以如果你把它移动到C ++,你应该看到更多的改进: / p>
Also note that a lot of the slowdown (as compared to tabulate
) happens in the R code, so if you move that to C++ instead, you should see more improvement:
system.time(rowsum_helper(x,f))
# user system elapsed
# 0.210 0.018 0.228
这里是一个概括,将处理几乎任何 y
,但是会慢一点(我实际上更喜欢在Rcpp这样做,但不知道如何处理任意R类型):
Here's a generalization that will handle almost any y
, but will be a little bit slower (I'd actually prefer doing this in Rcpp, but don't know how to handle arbitrary R types there):
rowsum_fast = function(x, y) {
if (is.numeric(y)) {
y.min = min(y)
y = y - y.min
res = rowsum_helper(x, y)
} else {
y = as.factor(y)
res = rowsum_helper(x, as.numeric(y))
}
elements = which(!is.nan(res))
if (is.factor(y)) {
list(levels(y)[elements-1], res[elements])
} else {
list(elements - 1 + y.min, res[elements])
}
}
这篇关于Rcpp等效于rowsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!