用于重复距离矩阵计算和超大距离矩阵分块的高效(内存方式)函数 [英] Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices

查看:40
本文介绍了用于重复距离矩阵计算和超大距离矩阵分块的高效(内存方式)函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人可以查看以下代码和最小示例并提出改进建议 - 特别是在处理非常大的数据集时代码的效率.

I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.

该函数采用一个 data.frame 并通过分组变量(因子)对其进行拆分,然后计算每组中所有行的距离矩阵.

The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.

我不需要保留距离矩阵 - 只有一些统计数据,即均值、直方图 ..,然后它们可以被丢弃.

I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.

我对内存分配等不太了解,我想知道这样做的最佳方法是什么,因为我将处理每组 10.000 - 100.000 个案例.任何想法将不胜感激!

I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!

此外,如果我遇到严重的内存问题,将 bigmemory 或其他一些大数据处理包包含到函数中的最不痛苦的方法是什么?

Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?

FactorDistances <- function(df) {
  # df is the data frame where the first column is the grouping variable. 
  # find names and number of groups in df (in the example there are three:(2,3,4)
  factor.names <- unique(df[1])
  n.factors <-length(unique(df$factor))
  # split df by factor into list - each subset dataframe is one list element
  df.l<-list()
  for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
  # use lapply to go through list and calculate distance matrix for each group
  # this results in a new list where each element is a distance matrix
  distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))  
  # again use lapply to get the mean distance for each group
  means <- lapply (distances,  mean)  
  rm(distances)
  gc()
  return(means)
}

df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN

我编辑了标题以反映我作为答案发布的组块问题..

I edited the title to reflect the chunking issue I posted as an answer..

推荐答案

我为那些 dist() 无法处理的超大矩阵提出了一个分块解决方案,我在这里发布以防其他人发现它有帮助(或发现它有问题,拜托!).它比 dist() 慢得多,但这有点无关紧要,因为它应该只在 dist() 抛出错误时使用 - 通常是以下之一:

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

该函数计算矩阵的平均距离,但您可以将其更改为其他任何内容,但如果您想实际保存矩阵,我相信某种文件支持的大内存矩阵是合适的.. 感谢 link 的想法和阿里的帮助!

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

不确定我是否应该将此作为编辑而不是答案发布..它确实解决了我的问题,尽管我并没有真正以这种方式指定它..

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..

这篇关于用于重复距离矩阵计算和超大距离矩阵分块的高效(内存方式)函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆