使用Rcpp查找重复项 [英] Finding duplicates using Rcpp

查看:184
本文介绍了使用Rcpp查找重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一个更快的替换,以在R中找到重复的。代码的意图是将矩阵传递给具有该行矩阵的行号的Rcpp,然后循环遍历整个矩阵,寻找匹配行。所讨论的矩阵是一个具有1000行和250列的逻辑矩阵。



听起来很简单,但下面的代码没有检测到等效的向量行。我不知道这是否是equals()函数的问题,或者是如何定义矩阵或向量。

  #include< Rcpp.h> 
使用命名空间Rcpp;

// [[Rcpp :: plugins]]
#include< cstddef> // std:size_t
#include< iterator> // std:begin,std :: end
#include< vector> // std :: vector
#include< iostream>
#include< string>

// [[Rcpp :: export]]
bool dupCheckRcpp(int nVector,
LogicalMatrix bigMatrix){
// initialize
int i, j,nrow,ncol;
nrow = bigMatrix.nrow();
ncol = bigMatrix.ncol();
LogicalVector vec(ncol); //保持感兴趣的向量
LogicalVector vecMatrix(ncol); //通过bigMatrix循环的临时向量
nVector = nVector - 1;

//根据nVector行将bigMatrix数据复制到vec
for(j = 0; j< ncol; ++ j){
vec(j)= bigMatrix( nVector,J);
}

// check loop:对于(i = 0; i< nrow; ++ i)的allMatrix
中的每一行检查vecTeam {
/ (j = 0; j< ncol; ++ j){
vecMatrix(j)= bigMatrix(i,j))将bigMatrix数据复制到vecMatrix
;
}
//检查相等性
if(i!= nVector){//如果nVector行
//比较vecTeam到vecMatrix
if(std: :equal(vec.begin(),vec.end(),vecMatrix.begin())){
return true;
}
}
} // close check loop
return false;
}


解决方案

我不确定错误在于您的代码,但请注意,您真的不应该需要手动复制Rcpp类型之间的元素,如下所示:

 <$ c $ (j = 0; j< ncol; ++ j){
vec(j)= bigMatrix(nVector,j);根据nVector行
将bigMatrix数据复制到vec中
}

几乎总是会是一个合适的类和/或适当的赋值运算符等等,这使您更加简洁和更安全地完成这一点(即不太容易编程错误)。这是一个更简单的实现:

  #include< Rcpp.h> 
使用命名空间Rcpp;

// [[Rcpp :: export]]
bool is_duplicate_row(R_xlen_t r,LogicalMatrix x){
R_xlen_t i = 0,nr = x.nrow();
const LogicalMatrix :: Row& y = x.row(r); (; i
{
if(is_true(all(y == x.row(i)))){
return true; (i = r + 1; i< nr; i ++){
if(is_true(all(y == x.row(i))
}
} )){
return true;
}
}

return false;
}

根据我上面的建议,




  • const LogicalMatrix :: Row& y = x.row(r); 给我们一个常量引用矩阵

  • r
  • x.row(i)是指 i



这两个表达式避免通过元素方式复制循环,更可读的IMO。另外,虽然使用 std :: equal 或任何其他标准算法确实没有错,使用Rcpp糖表达式,如 is_true(all(y == x.row(i)))可以进一步简化您的代码。






  set.seed(123)
m < - matrix(rbinom(1000 * 250,1,0.25)> 0,1000)
m [600,]< ; - m [2,]

which(sapply(1:nrow(m) - 1,is_duplicate_row,m))
#[1] 2 600

c(其中(重复(m,fromLast = TRUE)),(重复(m)))
#[1] 2 600


I'm trying to find a speedier replacement for finding duplicates in R. The intent of the code is to pass the matrix to Rcpp with a row number from that matrix, then loop through the entire matrix looking for a match for that row. The matrix in question is a Logical matrix with 1000 rows and 250 cols.

Sounds simple, but the code below is not detecting equivalent vector rows. I'm not sure if it's an issue with the equal() function or something in how the matrix or vectors are defined.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins]]
#include <cstddef>   // std:size_t
#include <iterator>  // std:begin, std::end
#include <vector>    // std::vector
#include <iostream>
#include <string>

// [[Rcpp::export]]
    bool dupCheckRcpp (int nVector, 
                        LogicalMatrix bigMatrix) {
    // initialize
      int i, j, nrow, ncol;
      nrow = bigMatrix.nrow();
      ncol = bigMatrix.ncol();
      LogicalVector vec(ncol);  // holds vector of interest
      LogicalVector vecMatrix(ncol); // temp vector for loop through bigMatrix
      nVector = nVector - 1;

    // copy bigMatrix data into vec based on nVector row
      for ( j = 0; j < ncol; ++j ) {
        vec(j) = bigMatrix(nVector,j);
      }

    // check loop: check vecTeam against each row in allMatrix
      for (i = 0; i < nrow; ++i) {  
        // copy bigMatrix data into vecMatrix
          for ( j = 0; j < ncol; ++j ) {
            vecMatrix(j) = bigMatrix(i,j);
          }
        // check for equality
          if (i != nVector) {  // skip if nVector row
            // compare vecTeam to vecMatrix
              if (std::equal(vec.begin(),vec.end(),vecMatrix.begin())) {
              return true;
            }
          }
      } // close check loop
      return false;
    }

解决方案

I'm not exactly sure where the mistake lies in your code, but note that you really shouldn't ever need to manually copy elements between Rcpp types like this:

// copy bigMatrix data into vec based on nVector row
for (j = 0; j < ncol; ++j) {
    vec(j) = bigMatrix(nVector, j);
}

There is almost always going to be a suitable class and / or appropriate assignment operator, etc. which allows you to accomplish this more succinctly and more safely (i.e. less prone to programming error). Here is a simpler implementation:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool is_duplicate_row(R_xlen_t r, LogicalMatrix x) {
    R_xlen_t i = 0, nr = x.nrow();
    const LogicalMatrix::Row& y = x.row(r);

    for (; i < r; i++) {
        if (is_true(all(y == x.row(i)))) {
            return true;
        }
    }
    for (i = r + 1; i < nr; i++) {
        if (is_true(all(y == x.row(i)))) {
            return true;
        }
    }

    return false;
}

In the spirit of my advice above,

  • const LogicalMatrix::Row& y = x.row(r); gives us a constant reference to the rth row of the matrix
  • x.row(i) refers to the ith row of x

Both of these expressions avoid element-wise copying via for loop, and are more readable IMO. Additionally, while there is certainly nothing wrong with using std::equal or any other standard algorithms, using Rcpp sugar expressions such as is_true(all(y == x.row(i))) can often simplify your code even further.


set.seed(123)
m <- matrix(rbinom(1000 * 250, 1, 0.25) > 0, 1000)
m[600,] <- m[2,]

which(sapply(1:nrow(m) - 1, is_duplicate_row, m))
# [1]   2 600

c(which(duplicated(m, fromLast = TRUE)), which(duplicated(m)))
# [1]   2 600

这篇关于使用Rcpp查找重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆