重新计算距离矩阵 [英] Recalculating distance matrix
问题描述
我有一个大的输入矩阵(4000x10000).我使用 dist()
为其计算欧几里得距离矩阵(大约需要5个小时).
我需要为相同"矩阵加一个额外的行(对于4001x10000矩阵)计算距离矩阵.在不重新计算整个矩阵的情况下确定距离矩阵的最快方法是什么?
I’ve got a large input matrix (4000x10000). I use dist()
to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?
推荐答案
我假设您的额外行意味着加分.如果这表示多余的变量/维度,则会要求使用其他答案.
I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
首先,对于矩阵的欧几里得距离,我建议使用 fields
包中的 rdist
函数.它是用Fortran编写的,比 dist
函数要快得多.它返回一个 matrix
而不是 dist
对象,但是您始终可以使用 as.matrix
和 as.dist
.
First of all, for euclidean distance of matrices, I'd recommend the rdist
function from the fields
package. It is written in Fortran and is a lot faster than the dist
function. It returns a matrix
instead of a dist
object, but you can always go from one to the other using as.matrix
and as.dist
.
这里是(小于您的)样本数据
Here is (smaller than yours) sample data
num.points <- 400
num.vars <- 1000
original.points <- matrix(runif(num.points * num.vars),
nrow = num.points, ncol = num.vars)
和您已经计算出的距离矩阵:
and the distance matrix you already computed:
d0 <- rdist(original.points)
对于附加点,您只需要计算附加点之间的距离以及附加点与原始点之间的距离.我将使用两个额外的点来表明该解决方案对于任何数量的额外点都是通用的:
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist <- rdist(extra.points)
outer.dist <- rdist(extra.points, original.points)
因此您可以将它们绑定到更大的距离矩阵:
so you can bind them to your bigger distance matrix:
d1 <- rbind(cbind(d0, t(outer.dist)),
cbind(outer.dist, inner.dist))
让我们检查一下它是否与完整的,长时间的重新运行相匹配:
Let's check that it matches what a full, long rerun would have produced:
d2 <- rdist(rbind(original.points, extra.points))
identical(d1, d2)
# [1] TRUE
这篇关于重新计算距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!