Rcpp等效于rowsum [英] Rcpp equivalent for rowsum

查看:252
本文介绍了Rcpp等效于rowsum的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个快速替代的R函数 rowsum 在C ++ / Rcpp / Eigen或Armadillo。

I am looking for a fast alternative for the R function rowsum in C++ / Rcpp / Eigen or Armadillo.

目的是根据分组向量<$ c $获得向量中的元素的总和 a c> b 。例如:

The purpose is to get the sum of elements in a vector a according to a grouping vector b. For example:

> a
 [1] 2 2 2 2 2 2 2 2 2 2    
> b
 [1] 1 1 1 1 1 2 2 2 2 2
> rowsum(a,b)
  [,1]
1   10
2   10

Rcpp 中编写一个简单的for循环很慢,但也许我的代码效率低下。

Writing a simple for loop in Rcpp is very slow, but maybe my code was just inefficient.

我也试过调用中的 rowsum >,但 rowsum 不是很快。

I tried also to call the function rowsum in Rcpp, however, rowsum is not very fast.

推荐答案

这是我尝试使用 Rcpp 该包,所以这样做指出我的低效率):

Here's my attempt at doing this using Rcpp (first time using the package, so do point out my inefficiencies):

library(inline)
library(Rcpp)

rowsum_helper = cxxfunction(signature(x = "numeric", y = "integer"), '
  NumericVector var(x);
  IntegerVector factor(y);

  std::vector<double> sum(*std::max_element(factor.begin(), factor.end()) + 1,
                          std::numeric_limits<double>::quiet_NaN());
  for (int i = 0, size = var.size(); i < size; ++i) {
    if (sum[factor[i]] != sum[factor[i]]) sum[factor[i]] = var[i];
    else sum[factor[i]] += var[i];
  }

  return NumericVector(sum.begin(), sum.end());
', plugin = "Rcpp")

rowsum_fast = function(x, y) {
  res = rowsum_helper(x, y)
  elements = which(!is.nan(res))
  list(elements - 1, res[elements])
}

对于Martin的示例数据来说,这是非常快的,但只有当因子包含非负整数,并且将消耗内存在在因子向量中的最大整数(上面的一个显而易见的改进是从max减去min以减少内存使用 - 这可以在R函数或C ++中完成)。

It's pretty fast for Martin's example data, but will only work if the factor consists of non-negative integers and will consume memory on the order of the largest integer in the factor vector (one obvious improvement to the above is to subtract min from max to decrease memory usage - which can be done in either the R function or the C++ one).

n = 1e7; x = runif(n); f = sample(n/2, n, T)

system.time(rowsum(x,f))
#    user  system elapsed 
#   14.241  0.170  14.412

system.time({tabulate(f); sum(x)})
#    user  system elapsed 
#   0.216   0.027   0.252

system.time(rowsum_fast(x,f))
#    user  system elapsed 
#   0.313   0.045   0.358

还要注意,在R代码中发生了很多减速(与 tabulate 相比),所以如果你把它移动到C ++,你应该看到更多的改进: / p>

Also note that a lot of the slowdown (as compared to tabulate) happens in the R code, so if you move that to C++ instead, you should see more improvement:

system.time(rowsum_helper(x,f))
#    user  system elapsed 
#   0.210   0.018   0.228






这里是一个概括,将处理几乎任何 y ,但是会慢一点(我实际上更喜欢在Rcpp这样做,但不知道如何处理任意R类型):


Here's a generalization that will handle almost any y, but will be a little bit slower (I'd actually prefer doing this in Rcpp, but don't know how to handle arbitrary R types there):

rowsum_fast = function(x, y) {
  if (is.numeric(y)) {
    y.min = min(y)
    y = y - y.min
    res = rowsum_helper(x, y)
  } else {
    y = as.factor(y)
    res = rowsum_helper(x, as.numeric(y))
  }

  elements = which(!is.nan(res))

  if (is.factor(y)) {
    list(levels(y)[elements-1], res[elements])
  } else {
    list(elements - 1 + y.min, res[elements])
  }
}

这篇关于Rcpp等效于rowsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆