Rcpp 快速统计模式功能,具有任意类型的矢量输入 [英] Rcpp fast statistical mode function with vector input of any type

查看:63
本文介绍了Rcpp 快速统计模式功能,具有任意类型的矢量输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为 R 构建一个超快速模式函数,用于聚合大型分类数据集.该函数应采用所有支持的 R 类型的向量输入并返回模式.我已阅读这篇文章本帮助页面等,但我无法使该函数接受所有 R 数据类型.我的代码现在适用于数字向量,我依赖于 Rcpp 糖包装函数:

I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false) 
{
    if (narm) x = x[!is_na(x)];
    NumericVector ux = unique(x);
    int y = ux[which_max(table(match(x, ux)))];
    return y;
}

此外,我想知道是否可以将 'narm' 参数重命名为 'na.rm' 而不会出错,当然是否有更快的方法用 C++ 编写一个模式函数,我将不胜感激.

In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.

推荐答案

为了使函数适用于任何向量输入,您可以为您想要支持的任何数据类型实现@JosephWood 算法,并从 开关(TYPEOF(x)).但这将是大量的代码重复.相反,最好创建一个可以处理任何 Vector 参数的通用函数.如果我们遵循 R 的范式,一切都是向量,并且让函数也返回一个 Vector,那么我们可以利用 RCPP_RETURN_VECTOR.请注意,我们需要 C++11 才能将附加参数传递给 RCPP_RETURN_VECTOR 调用的函数.一件棘手的事情是您需要 Vector 的存储类型,以便创建合适的 std::unordered_map.这里 Rcpp::traits::storage_type::type 来帮忙了.但是,std::unordered_map 不知道如何处理来自 R 的复数.为了简单起见,我禁用了这种特殊情况.

In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)). But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR. One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map. Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.

综合起来:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
  if (narm) x = x[!is_na(x)];
  int myMax = 1;
  Vector<RTYPE> myMode(1);
  // special case for factors == INTSXP with "class" and "levels" attribute
  if (x.hasAttribute("levels")){
    myMode.attr("class") = x.attr("class");
    myMode.attr("levels") = x.attr("levels");
  }
  std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
  modeMap.reserve(x.size());

  for (std::size_t i = 0, len = x.size(); i < len; ++i) {
    auto it = modeMap.find(x[i]);

    if (it != modeMap.end()) {
      ++(it->second);
      if (it->second > myMax) {
        myMax = it->second;
        myMode[0] = x[i];
      }
    } else {
      modeMap.insert({x[i], 1});
    }
  }

  return myMode;
}

template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
  stop("Not supported SEXP type!");
}

// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
  RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}

/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f) 
*/

输出:

> set.seed(1234)

> s <- sample(1e5, replace = TRUE)

> fastMode(s)
[1] 85433

> fastMode(s + 0.1)
[1] 85433.1

> l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 

> fastMode(l)
[1] TRUE

> c <- sample(letters, 1e5, replace = TRUE)

> fastMode(c)
[1] "z"

> f <- as.factor(c)

> fastMode(f) 
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

如上所述,所使用的算法来自 Joseph Wood 的回答,该算法已在 CC- 下明确双重许可-BY-SA 和 GPL >= 2. 我正在关注 Joseph 并特此在 GPL(版本 2 或更高版本)以及隐式 CC-BY-SA 许可.

As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.

这篇关于Rcpp 快速统计模式功能,具有任意类型的矢量输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆