高效的(内存方式)函数,用于重复距离矩阵计算和超大距离矩阵的分块 [英] Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices

查看:145
本文介绍了高效的(内存方式)函数,用于重复距离矩阵计算和超大距离矩阵的分块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人可以看下面的代码和最小的示例并提出改进建议-特别是在处理非常大的数据集时的代码效率.

I wonder if anyone could have a look at the following code and minimal example and suggest improvements - in particular regarding efficiency of the code when working with really large data sets.

该函数获取一个data.frame并将其按分组变量(因子)进行拆分,然后计算每个组中所有行的距离矩阵.

The function takes a data.frame and splits it by a grouping variable (factor) and then calculates the distance matrix for all the rows in each group.

我不需要保留距离矩阵-只需保留一些统计信息,即均值,直方图..然后就可以将其丢弃.

I do not need to keep the distance matrices - only some statistics ie the mean, the histogram .., then they can be discarded.

我对内存分配之类的知识并不了解,并且想知道什么是最好的方法,因为我将每组处理10.000-100.000个案例.任何想法将不胜感激!

I don't know much about memory allocation and the like and am wondering what would be the best way to do this, since I will be working with 10.000 - 100.000 of cases per group. Any thoughts will be greatly appreciated!

此外,将bigmemory或其他一些大数据处理包包含到函数中的最痛苦的方式是什么,以防万一我遇到了严重的内存问题?

Also, what would be the least painful way of including bigmemory or some other large data handling package into the function as is in case I run into serious memory issues?

FactorDistances <- function(df) {
  # df is the data frame where the first column is the grouping variable. 
  # find names and number of groups in df (in the example there are three:(2,3,4)
  factor.names <- unique(df[1])
  n.factors <-length(unique(df$factor))
  # split df by factor into list - each subset dataframe is one list element
  df.l<-list()
  for (f in 1:n.factors) {df.l[[f]]<-df[which(df$factor==factor.names[f,]),]}
  # use lapply to go through list and calculate distance matrix for each group
  # this results in a new list where each element is a distance matrix
  distances <- lapply (df.l, function(x) dist(x[,2:length(x)], method="minkowski", p=2))  
  # again use lapply to get the mean distance for each group
  means <- lapply (distances,  mean)  
  rm(distances)
  gc()
  return(means)
}

df <- data.frame(cbind(factor=rep(2:4,2:4), rnorm(9), rnorm(9)))
FactorDistances(df)
# The result are three average euclidean distances between all pairs in each group
# If a group has only one member, the value is NaN

我对标题进行了编辑,以反映作为答复发布的分块问题.

I edited the title to reflect the chunking issue I posted as an answer..

推荐答案

我为dist()无法处理的那些超大型矩阵提出了分块解决方案,如果有其他人,我将在此处发布认为它很有帮助(或发现问题,请!).它比dist()慢得多,但这无关紧要,因为只有在dist()引发错误时才应使用它-通常是以下之一:

I've come up with a chunking solution for those extra large matrices that dist() can't handle, which I'm posting here in case anyone else finds it helpful (or finds fault with it, please!). It is significantly slower than dist(), but that is kind of irrelevant, since it should only ever be used when dist() throws an error - usually one of the following:

"Error in double(N * (N - 1)/2) : vector size specified is too large" 
"Error: cannot allocate vector of size 6.0 Gb"
"Error: negative length vectors are not allowed"

该函数计算矩阵的平均距离,但是您可以将其更改为其他任何值,但是如果您想实际保存矩阵,我相信某种文件支持的大内存矩阵是有序的. ="http://stevemosher.wordpress.com/2012/04/12/nick-stokes-distance-code-now-with-big-memory/" rel ="noreferrer">链接,以获取想法和阿里的帮助!

The function calculates the mean distance for the matrix, but you can change that to anything else, but in case you want to actually save the matrix I believe some sort of filebacked bigmemory matrix is in order.. Kudos to link for the idea and Ari for his help!

FunDistanceMatrixChunking <- function (df, blockSize=100){
  n <- nrow(df)
  blocks <- n %/% blockSize
  if((n %% blockSize) > 0)blocks <- blocks + 1
  chunk.means <- matrix(NA, nrow=blocks*(blocks+1)/2, ncol= 2)
  dex <- 1:blockSize
  chunk <- 0
  for(i in 1:blocks){    
    p <- dex + (i-1)*blockSize
    lex <- (blockSize+1):(2*blockSize)
    lex <- lex[p<= n]
    p <- p[p<= n]
    for(j in 1:blocks){
      q <- dex +(j-1)*blockSize
      q <- q[q<=n]     
      if (i == j) {       
        chunk <- chunk+1
        x <- dist(df[p,])
        chunk.means[chunk,] <- c(length(x), mean(x))}
      if ( i > j) {
        chunk <- chunk+1
        x <- as.matrix(dist(df[c(q,p),]))[lex,dex] 
        chunk.means[chunk,] <- c(length(x), mean(x))}
    }
  }
  mean <- weighted.mean(chunk.means[,2], chunk.means[,1])
  return(mean)
}
df <- cbind(var1=rnorm(1000), var2=rnorm(1000))
mean(dist(df))
FunDistanceMatrixChunking(df, blockSize=100)

不确定我是否应该将其发布为编辑而不是答案.它确实解决了我的问题,尽管我并不是真的这样指定..

Not sure whether I should have posted this as an edit, instead of an answer.. It does solve my problem, although I didn't really specify it this way..

这篇关于高效的(内存方式)函数,用于重复距离矩阵计算和超大距离矩阵的分块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆