如何计算大数据帧的欧几里德距离(仅保存摘要) [英] How to calculate Euclidean distance (and save only summaries) for large data frames

查看:196
本文介绍了如何计算大数据帧的欧几里德距离(仅保存摘要)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写了一个简短的'for'循环来查找数据框中每一行和所有其他行之间的最小欧式距离(并记录哪一行最接近)。理论上这避免了与尝试计算非常大的矩阵的距离测量值相关的误差。然而,虽然没有太多的东西被保存在内存中,但对于大型的矩阵来说却是非常慢的(我的约150K行的用例仍在运行)。

我想知道是否有人可以通过使用apply或者类似的方式向我提出建议,或者在向量化我的函数方面指向正确的方向。对于这个看起来很简单的问题抱歉,但是我仍然在努力以矢量化的方式思考。



在此先感谢(耐心等待)。

  require (代理)

df <-data.frame(矩阵(runif(10 * 10),nrow = 10,ncol = 10),row.names = paste(site,seq(1: 10)))

min.dist< -function(df){
#df for results
all.min.dist< -data.frame()
#设置循环
(k in 1:nrow(df)){
#calcuate每行与所有其他行之间的不相似性
df.dist< -dist(df [k, ],df [-k,])
#找到最小距离
min.dist< -min(df.dist)
#获得最小距离的rowname(最近点的id)
nearest.row< -row.names(df)[ - k] [which.min(df.dist)]
#combine outputs
all.min.dist< -rbind(all.min .dist,data.frame(orig_row = row.names(df)[k],
dist = min.dist,closest_row = closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)


解决方案

这应该是一个好的开始。它使用快速矩阵操作,并避免了越来越多的对象构造,两者都在评论中提出。

  min.dist<  -  function df){

which.closest < - 函数(k,df){
d <-colSums((df [,-k] - df [,k])^ 2)
m < - which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d [m]),
nearest_row = row.names(df)[ - k] [m])
}

do.call(rbind,lapply(1:nrow(df),which.closest,t (as.matrix(df))))
}

如果这仍然太慢,作为一个建议的改进,你可以一次计算点的距离,而不是一个单独的点。 的大小需要在速度和内存使用之间进行折中。



编辑:< a href =https://stackoverflow.com/a/16670220/1201032> https://stackoverflow.com/a/16670220/1201032


I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).

I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.

Thanks in advance (and for your patience).

require(proxy)

df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))

min.dist<-function(df) {  
 #df for results
 all.min.dist<-data.frame()
 #set up for loop 
 for(k in 1:nrow(df)) {
     #calcuate dissimilarity between each row and all other rows
     df.dist<-dist(df[k,],df[-k,])
     # find minimum distance
     min.dist<-min(df.dist)
     # get rowname for minimum distance (id of nearest point)
     closest.row<-row.names(df)[-k][which.min(df.dist)]
     #combine outputs
     all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
     dist=min.dist, closest_row=closest.row))
    }
 #return results
 return(all.min.dist)
                        } 
 #example
 min.dist(df)

解决方案

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.

min.dist <- function(df) {

  which.closest <- function(k, df) {
    d <- colSums((df[, -k] - df[, k]) ^ 2)
    m <- which.min(d)
    data.frame(orig_row    = row.names(df)[k],
               dist        = sqrt(d[m]),
               closest_row = row.names(df)[-k][m])
  }

  do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}

If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.

Edit: Also read https://stackoverflow.com/a/16670220/1201032

这篇关于如何计算大数据帧的欧几里德距离(仅保存摘要)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆