在 Rcpp 中按列对数据框进行排序 [英] order a dataframe by column in Rcpp
问题描述
是否有任何简单的方法可以通过 RCpp 中的两个(或多个或一个)列对 DataFrame 进行排序?
Is there any easy way to order a DataFrame by two (or more or one) of its columns within RCpp?
网络上有许多可用的排序算法,或者我可以使用带有 DataFrame 包装器的 std::sort
,但我想知道 RCpp 或 RCppArmadillo 中是否已有可用的东西?
There are many sorting algorithms available on the net, or I can use std::sort
with a wrapper for DataFrame, but I was wondering if there is something already available within either RCpp or RCppArmadillo?
我需要将此排序/排序作为另一个功能的一部分
I need to do this sorting / ordering as a part of another function
DataFrame myFunc(DataFrame myDF, NumericVector x) {
//// some code here
DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort??
//// some code here
}
我想避免在 RCpp 中访问 R 的 order
函数(为了保持 RCpp 代码的速度).
I would like to avoid accessing R's order
function within RCpp (for retaining speed of the RCpp code).
非常感谢
推荐答案
难点在于数据框是一组可能具有不同类型的向量;我们需要一种独立于这些类型(整数、字符等)对它们进行排序的方法.在 dplyr 中,我们开发了所谓的矢量访问者.对于这个特定的问题,我们需要的是一组OrderVisitor
,它表现出如下界面:
The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of OrderVisitor
, which exhibit the following interface:
class OrderVisitor {
public:
virtual ~OrderVisitor(){}
/** are the elements at indices i and j equal */
virtual bool equal(int i, int j) const = 0 ;
/** is the i element less than the j element */
virtual bool before( int i, int j) const = 0 ;
virtual SEXP get() = 0 ;
} ;
dplyr 然后为我们在此 file 并且我们有一个调度函数 order_visitor
,它从一个向量中生成一个 OrderVisitor*
.
dplyr then has implementations of OrderVisitor
for all types we are supporting in this file and we have a dispatcher function order_visitor
that makes an OrderVisitor*
from a vector.
有了这个,我们可以将一组向量访问者存储到一个std::vector
;OrderVisitors 有一个构造函数采用 DataFrame
和我们要用于排序的向量名称的 CharacterVector
.
With this, we can store a set of vector visitors into a std::vector<OrderVisitor*>
; The OrderVisitors has a constructor taking a DataFrame
and a CharacterVector
of names of vectors we want to use for the ordering.
OrderVisitors o(data, names ) ;
然后我们可以使用OrderVisitors.apply代码> 方法
本质上进行字典排序:
Then we can use the OrderVisitors.apply
method which essentially does lexicographic ordering:
IntegerVector index = o.apply() ;
apply
方法是通过简单地用 0..n
和 std::sort初始化一个
IntegerVector
来实现的代码>它根据访问者.
The apply
method is implemented by simply initializing an IntegerVector
with 0..n
and then std::sort
it according to the visitors.
inline Rcpp::IntegerVector OrderVisitors::apply() const {
IntegerVector x = seq(0, nrows -1 ) ;
std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ;
return x ;
}
这里的相关内容是 OrderVisitors_Compare
类如何实现 operator()(int,int)
:
The relevant thing here is how the OrderVisitors_Compare
class implements operator()(int,int)
:
inline bool operator()(int i, int j) const {
if( i == j ) return false ;
for( int k=0; k<n; k++)
if( ! obj.visitors[k]->equal(i,j) )
return obj.visitors[k]->before(i, j ) ;
return i < j ;
}
所以此时index
给了我们排序数据的整数索引,我们只需要从data
创建一个新的DataFrame
用这些索引子集 data
.为此,我们有另一种访问者,封装在 DataFrameVisitors
类中.我们首先创建一个DataFrameVisitors
:
So at this point index
gives us the integer indices of the sorted data, we just have to make a new DataFrame
from data
by subsetting data
with these indices. For this we have another kind of visitors, encapsulated in the DataFrameVisitors
class. We first create a DataFrameVisitors
:
DataFrameVisitors visitors( data ) ;
这封装了一个 std::vector
.每个 VectorVisitor*
都知道如何使用整数向量索引对自身进行子集化.这是从 DataFrameVisitors.subset
使用的:
This encapsulates a std::vector<VectorVisitor*>
. Each of these VectorVisitor*
knows how to subset itself with an integer vector index. This is used from DataFrameVisitors.subset
:
template <typename Container>
DataFrame subset( const Container& index, const CharacterVector& classes ) const {
List out(nvisitors);
for( int k=0; k<nvisitors; k++){
out[k] = get(k)->subset(index) ;
}
structure( out, Rf_length(out[0]) , classes) ;
return (SEXP)out ;
}
总结一下,这里是一个使用 dplyr 开发的工具的简单函数:
To wrap this up, here is a simple function using tools developped in dplyr:
#include <dplyr.h>
// [[Rcpp::depends(dplyr)]]
using namespace Rcpp ;
using namespace dplyr ;
// [[Rcpp::export]]
DataFrame myFunc(DataFrame data, CharacterVector names) {
OrderVisitors o(data, names ) ;
IntegerVector index = o.apply() ;
DataFrameVisitors visitors( data ) ;
DataFrame res = visitors.subset(index, "data.frame" ) ;
return res ;
}
这篇关于在 Rcpp 中按列对数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!