如何计算大数据帧的欧几里德距离(仅保存摘要) [英] How to calculate Euclidean distance (and save only summaries) for large data frames
问题描述
我想知道是否有人可以通过使用apply或者类似的方式向我提出建议,或者在向量化我的函数方面指向正确的方向。对于这个看起来很简单的问题抱歉,但是我仍然在努力以矢量化的方式思考。
在此先感谢(耐心等待)。
require (代理)
df <-data.frame(矩阵(runif(10 * 10),nrow = 10,ncol = 10),row.names = paste(site,seq(1: 10)))
min.dist< -function(df){
#df for results
all.min.dist< -data.frame()
#设置循环
(k in 1:nrow(df)){
#calcuate每行与所有其他行之间的不相似性
df.dist< -dist(df [k, ],df [-k,])
#找到最小距离
min.dist< -min(df.dist)
#获得最小距离的rowname(最近点的id)
nearest.row< -row.names(df)[ - k] [which.min(df.dist)]
#combine outputs
all.min.dist< -rbind(all.min .dist,data.frame(orig_row = row.names(df)[k],
dist = min.dist,closest_row = closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)
这应该是一个好的开始。它使用快速矩阵操作,并避免了越来越多的对象构造,两者都在评论中提出。
min.dist< - function df){
which.closest < - 函数(k,df){
d <-colSums((df [,-k] - df [,k])^ 2)
m < - which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d [m]),
nearest_row = row.names(df)[ - k] [m])
}
do.call(rbind,lapply(1:nrow(df),which.closest,t (as.matrix(df))))
}
如果这仍然太慢,作为一个建议的改进,你可以一次计算点的距离,而不是一个单独的点。 的大小需要在速度和内存使用之间进行折中。
编辑:< a href =https://stackoverflow.com/a/16670220/1201032> https://stackoverflow.com/a/16670220/1201032
I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).
I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.
Thanks in advance (and for your patience).
require(proxy)
df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))
min.dist<-function(df) {
#df for results
all.min.dist<-data.frame()
#set up for loop
for(k in 1:nrow(df)) {
#calcuate dissimilarity between each row and all other rows
df.dist<-dist(df[k,],df[-k,])
# find minimum distance
min.dist<-min(df.dist)
# get rowname for minimum distance (id of nearest point)
closest.row<-row.names(df)[-k][which.min(df.dist)]
#combine outputs
all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
dist=min.dist, closest_row=closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)
This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.
min.dist <- function(df) {
which.closest <- function(k, df) {
d <- colSums((df[, -k] - df[, k]) ^ 2)
m <- which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d[m]),
closest_row = row.names(df)[-k][m])
}
do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}
If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.
Edit: Also read https://stackoverflow.com/a/16670220/1201032
这篇关于如何计算大数据帧的欧几里德距离(仅保存摘要)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!