使用big.matrix对象计算欧几里得距离矩阵 [英] Calculate Euclidean distance matrix using a big.matrix object
问题描述
我在R
中有一个尺寸为778844 x 2
的big.matrix
类对象.值都是整数(千米).我的目标是使用big.matrix
计算欧几里得距离矩阵,并因此得到类big.matrix
的对象.我想知道是否有最佳的方法.
I have an object of class big.matrix
in R
with dimension 778844 x 2
. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix
and have as a result an object of class big.matrix
. I would like to know if there is an optimal way of doing that.
我选择使用类big.matrix
的原因是内存限制.我可以将我的big.matrix
转换为类matrix
的对象,并使用dist()
计算欧几里得距离矩阵.但是,dist()
会返回一个对象,该对象的大小不会在内存中分配.
The reason for my choice of using the class big.matrix
is memory limitation. I could transform my big.matrix
to an object of class matrix
and calculate the Euclidean distance matrix using dist()
. However, dist()
would return an object of size that would not be allocated in the memory.
修改
以下答案由bigmemory
软件包的作者和维护者John W. Emerson给出:
The following answer was given by John W. Emerson, author and maintainer of the bigmemory
package:
您可以使用我期望的大代数,但这对于通过sourceCpp()的Rcpp来说是一个非常好的用例,而且非常简短易用.简而言之,我们甚至不尝试提供高级功能(除了我们作为概念验证实现的基础之外).一旦您开始谈论内存不足时,没有任何一种算法可以涵盖所有用例.
You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.
推荐答案
以下是使用RcppArmadillo
的一种方法.其中大部分与 RcppGallery示例非常相似.这将返回一个big.matrix
及其相关的成对(按行)欧氏距离.我想将big.matrix
函数包装在包装函数中以创建更简洁的语法(即避免@address
和其他初始化.
Here is a way using RcppArmadillo
. Much of this is very similar to the RcppGallery example. This will return a big.matrix
with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix
functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address
and other initializations.
注意-由于我们正在使用bigmemory(因此与RAM使用有关),我在此示例中返回了仅包含下三角元素的N-1 x N-1矩阵.您可以修改它,但这就是我的想法.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
我的小包装纸
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat@address, zeros@address)
return(zeros)
}
测试
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE
这篇关于使用big.matrix对象计算欧几里得距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!