使用 big.matrix 对象计算欧几里得距离矩阵 [英] Calculate Euclidean distance matrix using a big.matrix object

查看:27
本文介绍了使用 big.matrix 对象计算欧几里得距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个 big.matrix 类的对象,维度为 778844 x 2.这些值都是整数(公里).我的目标是使用 big.matrix 计算欧几里得距离矩阵,并得到一个 big.matrix 类的对象.我想知道是否有最佳方法可以做到这一点.

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.

我选择使用 big.matrix 类的原因是内存限制.我可以将我的 big.matrix 转换为 matrix 类的对象,并使用 dist() 计算欧几里德距离矩阵.然而,dist() 将返回一个不会在内存中分配的大小的对象.

The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.

编辑

以下答案由 bigmemory 包的作者和维护者 John W. Emerson 给出:

The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:

我期望您可以使用大代数,但这也是通过 sourceCpp() 使用 Rcpp 的一个非常好的用例,而且非常简短和简单.但简而言之,我们甚至不尝试提供高级功能(除了我们作为概念验证实现的基础功能).一旦开始讨论内存不足问题,就没有一种算法可以涵盖所有用例.

You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.

推荐答案

这是使用 RcppArmadillo 的方法.其中大部分与 RcppGallery 示例非常相似.这将返回一个 big.matrix 与相关的成对(按行)欧几里德距离.我喜欢将我的 big.matrix 函数包装在一个包装函数中以创建更清晰的语法(即避免 @address 和其他初始化.

Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address and other initializations.

注意 - 由于我们使用的是大内存(因此与 RAM 使用有关),我在此示例中返回了仅包含下三角元素的 N-1 x N-1 矩阵.你可以修改这个,但这是我拼凑起来的.

Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.

euc_dist.cpp

euc_dist.cpp

// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]

using namespace Rcpp;
using namespace arma;

// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>

// C++11 plugin
// [[Rcpp::plugins(cpp11)]]

template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {

  int W = inBigMat.n_rows;

  for(int i = 0; i < W - 1; i++){
      for(int j=i+1; j < W; j++){
          outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
      }
  }
}

// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
  // First we tell Rcpp that the object we've been given is an external
  // pointer.
  XPtr<BigMatrix> xpMat(pInBigMat);
  XPtr<BigMatrix> xpOutMat(pOutBigMat);


  int type = xpMat->matrix_type();
  switch(type) {
      case 1:
        BigArmaEuclidean(
            arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
            arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 2:
        BigArmaEuclidean(
          arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 4:
        BigArmaEuclidean(
          arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 8:
        BigArmaEuclidean(
          arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      default:
        // We should never get here, but it resolves compiler warnings.
        throw Rcpp::exception("Undefined type for provided big.matrix");
  }

}

我的小包装

bigMatrixEuc <- function(bigMat){
    zeros <- big.matrix(nrow = nrow(bigMat)-1,
                        ncol = nrow(bigMat)-1,
                        init = 0,
                        type = typeof(bigMat))
    BigArmaEuc(bigMat@address, zeros@address)
    return(zeros)
}

测试

library(Rcpp)
sourceCpp("euc_dist.cpp")

library(bigmemory)

set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)

# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]

# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]

# check if identical 
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE

这篇关于使用 big.matrix 对象计算欧几里得距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆