减少NbClust的内存使用量 [英] Reducing NbClust memory usage

查看:361
本文介绍了减少NbClust的内存使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些有关NbClust函数大量使用内存的帮助. 根据我的数据,内存膨胀到56GB,这时R发生致命错误而崩溃.使用debug(),我能够将错误跟踪到这些行:

I need some help with massive usage of memory by the NbClust function. On my data, memory balloons to 56GB at which point R crashes with a fatal error. Using debug(), I was able to trace the error to these lines:

            if (any(indice == 23) || (indice == 32)) {
                res[nc - min_nc + 1, 23] <- Index.sPlussMoins(cl1 = cl1, 
                    md = md)$gamma

对Index.sPlussMoins的调试显示,崩溃发生在for循环期间.它崩溃的迭代次数各不相同,并且在循环期间内存使用量在41到57Gb之间变化(我总共有64个):

Debugging of Index.sPlussMoins revealed that the crash happens during a for loop. The iteration that it crashes at varies, and during the loop memory usage varies between 41 and 57Gb (I have 64 total):

    for (k in 1:nwithin1) {
      s.plus <- s.plus + (colSums(outer(between.dist1, 
                                        within.dist1[k], ">")))
      s.moins <- s.moins + (colSums(outer(between.dist1, 
                                          within.dist1[k], "<")))
      print(s.moins)
    }

我猜测内存使用量来自outer()函数. 我可以修改NbClust使其具有更高的内存效率吗(也许使用bigmemory包)? 至少,最好让R以无法分配大小向量..."退出函数,而不是崩溃.这样一来,我就会知道需要多大的内存来处理导致崩溃的矩阵.

I'm guessing that the memory usage comes from the outer() function. Can I modify NbClust to be more memory efficient (perhaps using the bigmemory package)? At very least, it would be nice to get R to exit the function with an "cannot allocate vector of size..." instead of crashing. That way I would have an idea of just how much more memory I need to handle the matrix causing the crash.

我创建了一个最小的示例,该示例的矩阵近似于我正在使用的矩阵的大小,尽管当调用hclust函数时,它崩溃于另一个点:

I created a minimal example with a matrix the approximate size of the one I am using, although now it crashes at a different point, when the hclust function is called:

set.seed(123)

cluster_means = sample(1:25, 10)
mlist = list()
for(cm in cluster_means){
  name = as.character(cm)
  m = data.frame(matrix(rnorm(60000*60,mean=cm,sd=runif(1, 0.5, 3.5)), 60000, 60))
  mlist[[name]] = m
}

test_data = do.call(cbind, cbind(mlist))

library(NbClust)
debug(fun = "NbClust")
nbc = NbClust(data = test_data, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 30, 
              method = "ward.D2", index = "alllong", alphaBeale = 0.1)

debug: hc <- hclust(md, method = "ward.D2")

它似乎在耗尽可用内存之前就崩溃了(根据我的系统监视器,当崩​​溃总数达到64时,正在使用34Gb.

It seems to crash before using up available memory (according to my system monitor, 34Gb is being used when it crashes out of 64 total.

那么,在没有对可管理大小的矩阵进行二次采样的情况下,有什么方法可以做到这一点?如果我做了,我如何知道给定大小的矩阵需要多少内存?我本以为我的64Gb足够了.

So is there any way I can do this without sub-sampling manageable sized matrices? And if I did, how do I know how much memory I will need for a matrix of a given size? I would have thought my 64Gb would be enough.

我尝试将NbClust更改为使用fastcluster而不是统计信息版本.它没有崩溃,但退出并出现内存错误:

I tried altering NbClust to use fastcluster instead of the stats version. It didn't crash, but did exit with a memory error:

Browse[2]> 
exiting from: fastcluster::hclust(md, method = "ward.D2")
Error: cannot allocate vector of size 9.3 Gb

推荐答案

如果查看Nbclust的源代码,将会发现它几乎在速度或内存效率方面都经过了优化.

If you check the source code of Nbclust, you'll see that is all but optimized for speed or memory efficiency.

您要报告的崩溃甚至不在集群期间-而是在评估之后,尤其是在"Gamma,Gplus和Tau"索引代码中. 禁用这些索引,您可能会走得更远,但很可能在另一个索引中再次遇到相同的问题.也许您只能选择要运行的几个索引,特别是这样的索引,以至于 不需要很多内存?

The crash you're reporting is not even during clustering - it's in the evaluation afterwards, specifically in the "Gamma, Gplus and Tau" index code. Disable these indexes and you may get further, but most likely you'll just have the same problem again in another index. Maybe you can pick only a few indices to run, specifically such indices that so not need a lot of memory?

这篇关于减少NbClust的内存使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆