做平均关系的 Rcpp 等级函数 [英] Rcpp rank function that does average ties

查看:47
本文介绍了做平均关系的 Rcpp 等级函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从这里偷来的自定义排名函数(经过一些修改):)用于秩函数的 Rcpp 糖

I have a custom rank function that I stole from here (with some modifications) :) Rcpp sugar for rank function

问题是它有最小的关系,而我需要平均关系

The problem is that it does min ties and I need average ties

这是我所拥有的

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector sort_rcpp(NumericVector x) {
  std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
  std::sort(tmp.begin(), tmp.end());
  return wrap(tmp);
}

// [[Rcpp::export]]
IntegerVector rank_(NumericVector x) {
  return match(x, sort_rcpp(x));
}


/*** R
x <- c(1:5, 1:5)

rank(x, ties = 'average')
rank(x, ties = 'min')
rank_(x)
*/

将其保存到文件然后运行它会得到结果

After saving that to a file then running this gets the results

Rcpp::sourceCpp('~/Documents/Rank.cpp')

哪个返回

# x <- c(1:5, 1:5)
#
# # what I need
# rank(x, ties = 'average')
# [1] 1.5 3.5 5.5 7.5 9.5 1.5 3.5 5.5 7.5 9.5
#
# # What I am getting
# rank(x, ties = 'min')
# [1] 1 3 5 7 9 1 3 5 7 9
#
# rank_(x)
# [1] 1 3 5 7 9 1 3 5 7 9

我需要在 C++ 代码中修改什么以匹配来自基础 R 的平均排名函数?

What do I need to modify in the c++ code to match the average rank function from base R?

推荐答案

这是 shayaa 的链接 -- 主要区别在于使用 Rcpp::seq 而不是 std::iota以及处理 NA 比较的自定义比较器:

This is an adapted version of René Richter's code in shayaa's link -- the main differences being the use of Rcpp::seq instead of std::iota and a custom comparator that handles NA comparisons:

#include <Rcpp.h>

class Comparator {
private:
    const Rcpp::NumericVector& ref;

    bool is_na(double x) const 
    {
        return Rcpp::traits::is_na<REALSXP>(x);    
    }

public:
    Comparator(const Rcpp::NumericVector& ref_)
        : ref(ref_)
    {}

    bool operator()(const int ilhs, const int irhs) const
    {
        double lhs = ref[ilhs], rhs = ref[irhs]; 
        if (is_na(lhs)) return false;
        if (is_na(rhs)) return true;
        return lhs < rhs;
    }
};

// [[Rcpp::export]]
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
    R_xlen_t sz = x.size();
    Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
    std::sort(w.begin(), w.end(), Comparator(x));

    Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
    for (R_xlen_t n, i = 0; i < sz; i += n) {
        n = 1;
        while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
        for (R_xlen_t k = 0; k < n; k++) {
            r[w[i + k]] = i + (n + 1) / 2.;
        }
    }

    return r;
}

<小时>

根据base::rank

x <- c(1:7, 1:2, 1:5, 1:10)
all.equal(
    rank(x, ties.method = "average"), 
    avg_rank(x)
)
# [1] TRUE 

另请注意,这可以正确处理 NAs 而您的版本则不能:

Also note that this handles NAs correctly while your version does not:

all.equal(
    rank(c(NA, x), ties.method = "average"), 
    avg_rank(c(NA, x))
)
# [1] TRUE

all.equal(
    rank(c(NA, x), ties.method = "average"), 
    rank_(c(NA, x))
)
# Error: can't subset using a logical vector with NAs

<小时>

这是上面向量 x 的基准:

microbenchmark::microbenchmark(
    ".Internal" = .Internal(rank(x, length(x), ties = "average")),
    avg_rank(x),
    "base::rank" = rank(x, ties.method = "average"),
    rank_(x),
    times = 1000L
)
# Unit: microseconds
#         expr    min     lq      mean median     uq     max neval
#    .Internal  1.283  1.711  2.029777  1.712  2.139  65.002  1000
#  avg_rank(x)  2.566  3.422  4.057262  3.849  4.277  23.521  1000
#   base::rank 13.685 16.251 18.145440 17.534 18.390 163.360  1000
#     rank_(x) 25.659 28.653 31.203092 29.936 32.074 112.898  1000

这是一个具有 1e6 长度向量的基准(我没有包括 rank_,因为即使是一次评估也需要永远;见下文):

And here is a benchmark with a 1e6-length vector (I didn't include rank_ because even a single evaluation was taking forever; see below):

set.seed(123)
xx <- sample(x, 1e6, TRUE)

microbenchmark::microbenchmark(
    ".Internal" = .Internal(rank(xx, length(xx), ties = "average")),
    avg_rank(xx),
    "base::rank" = rank(xx, ties.method = "average"),
    times = 100L
)
# Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
#     .Internal 302.2488 309.7977 334.7977 322.0396 347.4779 504.1041   100
#  avg_rank(xx) 304.4435 309.9840 330.4902 316.7807 331.6825 427.7171   100
#    base::rank 312.1196 327.3319 351.6237 343.1317 366.7316 445.9004   100

对于更大尺寸的向量,这三个函数的运行时间更接近.理论上,.Internal 调用应该总是比 base::rank 快一点,因为它放弃了在后者的主体中进行的额外检查.但是,在第二个基准测试中,差异不那么明显,因为执行这些检查所需的时间占函数总运行时间的比例要小得多.

For larger sized vectors, the runtime of these three functions is much closer. In theory, the .Internal call should always be a little faster than base::rank since it foregoes additional checks that take place in the body of the latter. However, the difference is less pronounced in the second benchmark as the amount of time needed to do those checks represents a much smaller proportion of the function's total runtime.

作为旁注,您的代码效率低下的一个明显原因是您正在循环体中创建向量:

As a side note, one of the obvious reasons your code is so inefficient is because you are creating vectors in the body of your loop:

for (int i = 0; i < n; ++i) {
    NumericVector xVal = x[x == x[i]];              // here
    IntegerVector Match = match(xVal, sortX);       // here
    double minM = min(Match);
    int matchSize = Match.size();
    NumericVector aves = NumericVector(matchSize);  // here

    for (int k = 0; k < matchSize; ++k) {
        aves[k] = minM + (k);
    }
    ranks[i] = sum(aves)/Match.size();
}

avg_rank 和 R 的实现(我相信,您可以仔细检查 源代码)只创建了两个额外的向量,而不管输入的大小.您的函数正在创建 2 + 3 * N 个向量 (!!!),其中 N 是输入的大小.

Both avg_rank and R's implementation (I believe, you can double check the source code) are only creating two additional vectors, regardless of the size of the input. Your function is creating 2 + 3 * N vectors (!!!), where N is the size of your input.

最后,与性能没有真正的关系,而不是像这样排序(这将正确处理NA),

And finally, not really related to performance, but instead of sorting like this (which will not handle NAs correctly),

NumericVector sort_rcpp(NumericVector x) {
    std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
    std::sort(tmp.begin(), tmp.end());
    return wrap(tmp);
} 

只需使用 Rcpp 提供的工具:

just use the tools Rcpp provides:

NumericVector RcppSort(NumericVector x) {
    return Rcpp::clone(x).sort();
}

这篇关于做平均关系的 Rcpp 等级函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆