在 Rcpp 中按列对数据框进行排序 [英] order a dataframe by column in Rcpp

查看:30
本文介绍了在 Rcpp 中按列对数据框进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何简单的方法可以通过 RCpp 中的两个(或多个或一个)列对 DataFrame 进行排序?

Is there any easy way to order a DataFrame by two (or more or one) of its columns within RCpp?

网络上有许多可用的排序算法,或者我可以使用带有 DataFrame 包装器的 std::sort,但我想知道 RCpp 或 RCppArmadillo 中是否已有可用的东西?

There are many sorting algorithms available on the net, or I can use std::sort with a wrapper for DataFrame, but I was wondering if there is something already available within either RCpp or RCppArmadillo?

我需要将此排序/排序作为另一个功能的一部分

I need to do this sorting / ordering as a part of another function

DataFrame myFunc(DataFrame myDF, NumericVector x) {
  //// some code here
  DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort??
  //// some code here
}

我想避免在 RCpp 中访问 R 的 order 函数(为了保持 RCpp 代码的速度).

I would like to avoid accessing R's order function within RCpp (for retaining speed of the RCpp code).

非常感谢

推荐答案

难点在于数据框是一组可能具有不同类型的向量;我们需要一种独立于这些类型(整数、字符等)对它们进行排序的方法.在 dplyr 中,我们开发了所谓的矢量访问者.对于这个特定的问题,我们需要的是一组OrderVisitor,它表现出如下界面:

The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of OrderVisitor, which exhibit the following interface:

class OrderVisitor {
public:
    virtual ~OrderVisitor(){}

    /** are the elements at indices i and j equal */
    virtual bool equal(int i, int j) const  = 0 ;

    /** is the i element less than the j element */
    virtual bool before( int i, int j) const = 0 ;

    virtual SEXP get() = 0 ;

} ;

dplyr 然后为我们在此 file 并且我们有一个调度函数 order_visitor,它从一个向量中生成一个 OrderVisitor*.

dplyr then has implementations of OrderVisitor for all types we are supporting in this file and we have a dispatcher function order_visitor that makes an OrderVisitor* from a vector.

有了这个,我们可以将一组向量访问者存储到一个std::vectorOrderVisitors 有一个构造函数采用 DataFrame 和我们要用于排序的向量名称的 CharacterVector.

With this, we can store a set of vector visitors into a std::vector<OrderVisitor*>; The OrderVisitors has a constructor taking a DataFrame and a CharacterVector of names of vectors we want to use for the ordering.

OrderVisitors o(data, names ) ;

然后我们可以使用OrderVisitors.apply 方法 本质上进行字典排序:

Then we can use the OrderVisitors.apply method which essentially does lexicographic ordering:

IntegerVector index = o.apply() ;

apply 方法是通过简单地用 0..nstd::sortIntegerVector 来实现的代码>它根据访问者.

The apply method is implemented by simply initializing an IntegerVector with 0..n and then std::sort it according to the visitors.

inline Rcpp::IntegerVector OrderVisitors::apply() const {
    IntegerVector x = seq(0, nrows -1 ) ;
    std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ;
    return x ;
}

这里的相关内容是 OrderVisitors_Compare 类如何实现 operator()(int,int) :

The relevant thing here is how the OrderVisitors_Compare class implements operator()(int,int) :

inline bool operator()(int i, int j) const {
    if( i == j ) return false ;
    for( int k=0; k<n; k++)
        if( ! obj.visitors[k]->equal(i,j) )
            return obj.visitors[k]->before(i, j ) ; 
    return i < j ;
}

所以此时index给了我们排序数据的整数索引,我们只需要从data创建一个新的DataFrame用这些索引子集 data .为此,我们有另一种访问者,封装在 DataFrameVisitors 类中.我们首先创建一个DataFrameVisitors:

So at this point index gives us the integer indices of the sorted data, we just have to make a new DataFrame from data by subsetting data with these indices. For this we have another kind of visitors, encapsulated in the DataFrameVisitors class. We first create a DataFrameVisitors :

DataFrameVisitors visitors( data ) ;

这封装了一个 std::vector.每个 VectorVisitor* 都知道如何使用整数向量索引对自身进行子集化.这是从 DataFrameVisitors.subset 使用的:

This encapsulates a std::vector<VectorVisitor*>. Each of these VectorVisitor* knows how to subset itself with an integer vector index. This is used from DataFrameVisitors.subset:

template <typename Container>
DataFrame subset( const Container& index, const CharacterVector& classes ) const {
    List out(nvisitors);
    for( int k=0; k<nvisitors; k++){
       out[k] = get(k)->subset(index) ;    
    }
    structure( out, Rf_length(out[0]) , classes) ;
    return (SEXP)out ;
}

总结一下,这里是一个使用 dplyr 开发的工具的简单函数:

To wrap this up, here is a simple function using tools developped in dplyr:

#include <dplyr.h>
// [[Rcpp::depends(dplyr)]]

using namespace Rcpp ;
using namespace dplyr ;

// [[Rcpp::export]]
DataFrame myFunc(DataFrame data, CharacterVector names) {
  OrderVisitors o(data, names ) ;
  IntegerVector index = o.apply() ;

  DataFrameVisitors visitors( data ) ;
  DataFrame res = visitors.subset(index, "data.frame" ) ;
  return res ;  
}

这篇关于在 Rcpp 中按列对数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆