Rcpp 快速统计模式功能,具有任意类型的矢量输入 [英] Rcpp fast statistical mode function with vector input of any type
问题描述
我正在尝试为 R 构建一个超快速模式函数,用于聚合大型分类数据集.该函数应采用所有支持的 R 类型的向量输入并返回模式.我已阅读这篇文章、本帮助页面等,但我无法使该函数接受所有 R 数据类型.我的代码现在适用于数字向量,我依赖于 Rcpp 糖包装函数:
I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post, This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false)
{
if (narm) x = x[!is_na(x)];
NumericVector ux = unique(x);
int y = ux[which_max(table(match(x, ux)))];
return y;
}
此外,我想知道是否可以将 'narm' 参数重命名为 'na.rm' 而不会出错,当然是否有更快的方法用 C++ 编写一个模式函数,我将不胜感激.
In addition I was wondering if the 'narm' argument can be renamed 'na.rm' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.
推荐答案
为了使函数适用于任何向量输入,您可以为您想要支持的任何数据类型实现@JosephWood 算法,并从 开关(TYPEOF(x))
.但这将是大量的代码重复.相反,最好创建一个可以处理任何 Vector
参数的通用函数.如果我们遵循 R 的范式,一切都是向量,并且让函数也返回一个 Vector
,那么我们可以利用 RCPP_RETURN_VECTOR
.请注意,我们需要 C++11 才能将附加参数传递给 RCPP_RETURN_VECTOR
调用的函数.一件棘手的事情是您需要 Vector
的存储类型,以便创建合适的 std::unordered_map
.这里 Rcpp::traits::storage_type
来帮忙了.但是,std::unordered_map
不知道如何处理来自 R 的复数.为了简单起见,我禁用了这种特殊情况.
In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x))
. But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE>
argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>
, then we can make use of RCPP_RETURN_VECTOR
. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR
. One tricky thing is that you need the storage type for Vector<RTYPE>
in order to create a suitable std::unordered_map
. Here Rcpp::traits::storage_type<RTYPE>::type
comes to the rescue. However, std::unordered_map
does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.
综合起来:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
if (narm) x = x[!is_na(x)];
int myMax = 1;
Vector<RTYPE> myMode(1);
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
myMode.attr("class") = x.attr("class");
myMode.attr("levels") = x.attr("levels");
}
std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode[0] = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
stop("Not supported SEXP type!");
}
// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}
/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f)
*/
输出:
> set.seed(1234)
> s <- sample(1e5, replace = TRUE)
> fastMode(s)
[1] 85433
> fastMode(s + 0.1)
[1] 85433.1
> l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
> fastMode(l)
[1] TRUE
> c <- sample(letters, 1e5, replace = TRUE)
> fastMode(c)
[1] "z"
> f <- as.factor(c)
> fastMode(f)
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
如上所述,所使用的算法来自 Joseph Wood 的回答,该算法已在 CC- 下明确双重许可-BY-SA 和 GPL >= 2. 我正在关注 Joseph 并特此在 GPL(版本 2 或更高版本)以及隐式 CC-BY-SA 许可.
As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.
这篇关于Rcpp 快速统计模式功能,具有任意类型的矢量输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!