在R中有效计算一个点与一组点之间的所有距离 [英] Calculating all distances between one point and a group of points efficiently in R
问题描述
首先,我是R的新手(昨天开始学习)。
First of all, I am new to R (I started yesterday).
我有两组点,数据
和中心
,第一个大小为 n
,第二个大小为 K
(例如, n = 3823
和 K = 10
),并且对于每个在第一组 i
中,我需要在第二组中以最小距离找到 j
。
I have two groups of points, data
and centers
, the first one of size n
and the second of size K
(for instance, n = 3823
and K = 10
), and for each i
in the first set, I need to find j
in the second with the minimum distance.
我的想法很简单:对于每个 i
,让 dist [j]
是 i
和 j
之间的距离,我只需要使用 which.min(dist )
来查找我要寻找的内容。
My idea is simple: for each i
, let dist[j]
be the distance between i
and j
, I only need to use which.min(dist)
to find what I am looking for.
每个点都是一个 64
翻倍,所以
Each point is an array of 64
doubles, so
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
我尝试过
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
这非常慢( n = 200
,这需要40多秒!!)。我写的最快的解决方案是
which is extremely slow (with n = 200
, it takes more than 40s!!). The fastest solution that I wrote is
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
即使它做了很多我不使用的计算(因为 dist(m )
计算 m
的所有行之间的距离,它比另一行要快得多(有人可以解释为什么吗?),但是它不能满足我的需求,因为它不会只使用一次。而且,距离
代码非常难看。我试图用
Even if it does a lot of computation that I don't use (because dist(m)
computes the distance between all rows of m
), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance
code is very ugly. I tried to replace it with
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
但是这似乎慢了两倍。我还尝试对每对使用 dist
,但它也较慢。
but this seems to be twice slower. I also tried to use dist
for each pair, but it is also slower.
我不知道什么现在要做。看来我做错了什么。关于如何更有效地执行此操作的任何想法吗?
I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?
ps:我需要用它来手动实现k-means(而且我需要这样做,它是分配)。我相信我只需要Euclidian距离,但是我还不确定,因此,我希望有一些可以轻松替换距离计算的代码。 stats :: kmeans
在一秒钟之内即可完成所有计算。
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans
do all computation in less than one second.
推荐答案
您可以将其浓缩为矩阵运算,而不是跨数据点进行迭代,这意味着您只需跨 K
进行迭代。
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K
.
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
运行:
utilisateur système écoulé
0.100 0.008 0.108
在我的笔记本电脑上。
这篇关于在R中有效计算一个点与一组点之间的所有距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!