使用 big.matrix 对象计算欧几里得距离矩阵 [英] Calculate Euclidean distance matrix using a big.matrix object
问题描述
我在 R
中有一个 big.matrix
类的对象,维度为 778844 x 2
.这些值都是整数(公里).我的目标是使用 big.matrix
计算欧几里得距离矩阵,并得到一个 big.matrix
类的对象.我想知道是否有最佳方法可以做到这一点.
I have an object of class big.matrix
in R
with dimension 778844 x 2
. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix
and have as a result an object of class big.matrix
. I would like to know if there is an optimal way of doing that.
我选择使用 big.matrix
类的原因是内存限制.我可以将我的 big.matrix
转换为 matrix
类的对象,并使用 dist()
计算欧几里德距离矩阵.然而,dist()
将返回一个不会在内存中分配的大小的对象.
The reason for my choice of using the class big.matrix
is memory limitation. I could transform my big.matrix
to an object of class matrix
and calculate the Euclidean distance matrix using dist()
. However, dist()
would return an object of size that would not be allocated in the memory.
编辑
以下答案由 bigmemory
包的作者和维护者 John W. Emerson 给出:
The following answer was given by John W. Emerson, author and maintainer of the bigmemory
package:
我期望您可以使用大代数,但这也是通过 sourceCpp() 使用 Rcpp 的一个非常好的用例,而且非常简短和简单.但简而言之,我们甚至不尝试提供高级功能(除了我们作为概念验证实现的基础功能).一旦开始讨论内存不足问题,就没有一种算法可以涵盖所有用例.
You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.
推荐答案
这是使用 RcppArmadillo
的方法.其中大部分与 RcppGallery 示例非常相似.这将返回一个 big.matrix
与相关的成对(按行)欧几里德距离.我喜欢将我的 big.matrix
函数包装在一个包装函数中以创建更清晰的语法(即避免 @address
和其他初始化.
Here is a way using RcppArmadillo
. Much of this is very similar to the RcppGallery example. This will return a big.matrix
with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix
functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address
and other initializations.
注意 - 由于我们使用的是大内存(因此与 RAM 使用有关),我在此示例中返回了仅包含下三角元素的 N-1 x N-1 矩阵.你可以修改这个,但这是我拼凑起来的.
Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.
euc_dist.cpp
euc_dist.cpp
// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]
using namespace Rcpp;
using namespace arma;
// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>
// C++11 plugin
// [[Rcpp::plugins(cpp11)]]
template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {
int W = inBigMat.n_rows;
for(int i = 0; i < W - 1; i++){
for(int j=i+1; j < W; j++){
outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
}
}
}
// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
// First we tell Rcpp that the object we've been given is an external
// pointer.
XPtr<BigMatrix> xpMat(pInBigMat);
XPtr<BigMatrix> xpOutMat(pOutBigMat);
int type = xpMat->matrix_type();
switch(type) {
case 1:
BigArmaEuclidean(
arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 2:
BigArmaEuclidean(
arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 4:
BigArmaEuclidean(
arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
case 8:
BigArmaEuclidean(
arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
);
return;
default:
// We should never get here, but it resolves compiler warnings.
throw Rcpp::exception("Undefined type for provided big.matrix");
}
}
我的小包装
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat@address, zeros@address)
return(zeros)
}
测试
library(Rcpp)
sourceCpp("euc_dist.cpp")
library(bigmemory)
set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)
# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]
# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]
# check if identical
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE
这篇关于使用 big.matrix 对象计算欧几里得距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!