在R中有效计算一个点与一组点之间的所有距离 [英] Calculating all distances between one point and a group of points efficiently in R

查看:374
本文介绍了在R中有效计算一个点与一组点之间的所有距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我是R的新手(昨天开始学习)。

First of all, I am new to R (I started yesterday).

我有两组点,数据中心,第一个大小为 n ,第二个大小为 K (例如, n = 3823 K = 10 ),并且对于每个在第一组 i 中,我需要在第二组中以最小距离找到 j

I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.

我的想法很简单:对于每个 i ,让 dist [j] i j 之间的距离,我只需要使用 which.min(dist )来查找我要寻找的内容。

My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.

每个点都是一个 64 翻倍,所以

Each point is an array of 64 doubles, so

> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64

我尝试过

for (i in 1:n) {
  for (j in 1:K) {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
  }
  S[i] <- which.min(d)
}

这非常慢( n = 200 ,这需要40多秒!!)。我写的最快的解决方案是

which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is

distance <- function(point, group) {
  return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
  d <- distance(data[i,], centers)
  which.min(d)
}

即使它做了很多我不使用的计算(因为 dist(m )计算 m 的所有行之间的距离,它比另一行要快得多(有人可以解释为什么吗?),但是它不能满足我的需求,因为它不会只使用一次。而且,距离代码非常难看。我试图用

Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with

distance <- function(point, group) {
  return (dist(rbind(point,group))[1:nrow(group)])
}

但是这似乎慢了两倍。我还尝试对每对使用 dist ,但它也较慢。

but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.

我不知道什么现在要做。看来我做错了什么。关于如何更有效地执行此操作的任何想法吗?

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps:我需要用它来手动实现k-means(而且我需要这样做,它是分配)。我相信我只需要Euclidian距离,但是我还不确定,因此,我希望有一些可以轻松替换距离计算的代码。 stats :: kmeans 在一秒钟之内即可完成所有计算。

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.

推荐答案

您可以将其浓缩为矩阵运算,而不是跨数据点进行迭代,这意味着您只需跨 K 进行迭代。

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

运行:

utilisateur     système      écoulé 
      0.100       0.008       0.108 

在我的笔记本电脑上。

这篇关于在R中有效计算一个点与一组点之间的所有距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆