使用big.matrix对象计算欧几里得距离矩阵 [英] Calculate Euclidean distance matrix using a big.matrix object

查看:153
本文介绍了使用big.matrix对象计算欧几里得距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有一个尺寸为778844 x 2big.matrix类对象.值都是整数(千米).我的目标是使用big.matrix计算欧几里得距离矩阵,并因此得到类big.matrix的对象.我想知道是否有最佳的方法.

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.

我选择使用类big.matrix的原因是内存限制.我可以将我的big.matrix转换为类matrix的对象,并使用dist()计算欧几里得距离矩阵.但是,dist()会返回一个对象,该对象的大小不会在内存中分配.

The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.

修改

以下答案由bigmemory软件包的作者和维护者John W. Emerson给出:

The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:

您可以使用我期望的大代数,但这对于通过sourceCpp()的Rcpp来说是一个非常好的用例,而且非常简短易用.简而言之,我们甚至不尝试提供高级功能(除了我们作为概念验证实现的基础之外).一旦您开始谈论内存不足时,没有任何一种算法可以涵盖所有用例.

You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.

推荐答案

以下是使用RcppArmadillo的一种方法.其中大部分与 RcppGallery示例非常相似.这将返回一个big.matrix及其相关的成对(按行)欧氏距离.我想将big.matrix函数包装在包装函数中以创建更简洁的语法(即避免@address和其他初始化.

Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address and other initializations.

注意-由于我们正在使用bigmemory(因此与RAM使用有关),我在此示例中返回了仅包含下三角元素的N-1 x N-1矩阵.您可以修改它,但这就是我的想法.

Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.

euc_dist.cpp

euc_dist.cpp

// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]

using namespace Rcpp;
using namespace arma;

// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>

// C++11 plugin
// [[Rcpp::plugins(cpp11)]]

template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {

  int W = inBigMat.n_rows;

  for(int i = 0; i < W - 1; i++){
      for(int j=i+1; j < W; j++){
          outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
      }
  }
}

// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
  // First we tell Rcpp that the object we've been given is an external
  // pointer.
  XPtr<BigMatrix> xpMat(pInBigMat);
  XPtr<BigMatrix> xpOutMat(pOutBigMat);


  int type = xpMat->matrix_type();
  switch(type) {
      case 1:
        BigArmaEuclidean(
            arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
            arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 2:
        BigArmaEuclidean(
          arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 4:
        BigArmaEuclidean(
          arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 8:
        BigArmaEuclidean(
          arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      default:
        // We should never get here, but it resolves compiler warnings.
        throw Rcpp::exception("Undefined type for provided big.matrix");
  }

}

我的小包装纸

bigMatrixEuc <- function(bigMat){
    zeros <- big.matrix(nrow = nrow(bigMat)-1,
                        ncol = nrow(bigMat)-1,
                        init = 0,
                        type = typeof(bigMat))
    BigArmaEuc(bigMat@address, zeros@address)
    return(zeros)
}

测试

library(Rcpp)
sourceCpp("euc_dist.cpp")

library(bigmemory)

set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)

# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]

# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]

# check if identical 
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE

这篇关于使用big.matrix对象计算欧几里得距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆