hclust大小限制? [英] hclust size limit?

查看:81
本文介绍了hclust大小限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R的新手。我正在尝试对大约5万个项目运行hclust()。我有10列要比较和5万行数据。当我尝试分配距离矩阵时,得到:无法分配5GB的向量。

I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".

对此是否有大小限制?如果是这样,我该如何做一些大型的事情呢?

Is there a size limit to this? If so, how do I go about doing a cluster of something this large?

EDIT

我最终增加了最大限制,并将机器的内存增加到8GB,这似乎已经解决了。

I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it.

推荐答案

经典分层聚类方法是在运行时 O(n ^ 3)和在内存中 O(n ^ 2)复杂。因此,是的,它们对大型数据集的伸缩性极差。显然,任何需要实现距离矩阵的东西都在 O(n ^ 2)或更糟的地方。

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.

O(n ^ 2)中运行的SLINK和CLINK等分层集群有一些专门知识,并且根据实现的不同,可能只需要 O(n)内存。

Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.

您可能想研究更现代的聚类算法。在 O(n log n)或更高版本中运行的任何内容都应该对您有用。 使用分层聚类有很多充分的理由:通常,它对噪声非常敏感(即,它实际上不知道如何处理异常值),并且难以对大数据进行解释集(树状图很好,但仅适用于小型数据集)。

You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

这篇关于hclust大小限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆