R,多重对应分析后出现层次聚类问题 [英] R, issue with a Hierarchical clustering after a Multiple correspondence analysis
问题描述
我想聚类一个数据集(600000个观测值),对于每个聚类,我都想要获得主要成分. 我的向量由一封电子邮件和30个定性变量组成. 每个定量变量都有4个类别:0、1、2和3.
I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components. My vectors are composed by one email and by 30 qualitative variables. Each quantitative variable has 4 classes: 0,1,2 and 3.
所以我要做的第一件事是加载库FactoMineR并加载我的数据:
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
然后,我将变量设置为定性的(尽管我排除了变量'email'):
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
我正在从引导程序中删除电子邮件:
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
我正在此新数据集中运行MCA:
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
我现在想使用hcpc函数对数据集进行聚类:
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
但是我收到以下错误消息:
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
您认为我应该怎么做?我的数据集太大了吗?我是否很好地使用了hcpc函数?
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
推荐答案
由于使用分层聚类,因此HCPC
需要计算600000 x 600000距离矩阵(约1800亿个元素)的下三角.您只是没有RAM来存储该对象,即使您这样做了,计算也可能要花费数小时甚至数天才能完成.
Since it uses hierarchical clustering, HCPC
needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
关于对大型数据集进行聚类的堆栈溢出/交叉验证的讨论很多.在R中具有解决方案的一些软件包括:
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
R上的k-均值聚类非常大,稀疏矩阵?(bigkmeans
)
>在R中聚集大数据并且与采样有关?(clara
)
如果要使用这些替代群集方法之一,则可以在示例中将其应用于mca.res$ind$coord
.
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord
in your example.
为解决问题在R中聚集非常大的数据集而提出的另一种想法,是首先使用k表示找到一定数量的聚类中心,然后使用层次聚类从那里构建树.该方法实际上是通过HCPC
的kk
参数实现的.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk
argument of HCPC
.
例如,使用FactoMineR
中的tea
数据集:
For example, using the tea
data set from FactoMineR
:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
consol
参数提供了使用k-means合并来自层次聚类的聚类的选项.当kk
设置为实数时,此选项不可用,因此,此处consol
设置为FALSE
.将对象res.consul
设置为NULL
可以解决FactoMineR
1.27中的一个小错误.
The consol
argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk
is set to a real number, hence consol
is set to FALSE
here. The object res.consul
is set to NULL
to work around a minor bug in FactoMineR
1.27.
下图显示了在前两个MCA轴上绘制的数据基于300个个体(kk = Inf
)和30 k均值中心(kk = 30
)的聚类:
The following plot show the clusters based on the 300 individuals (kk = Inf
) and based on the 30 k means centres (kk = 30
) for the data plotted on the first two MCA axes:
可以看出结果非常相似.您应该可以轻松地将其应用于600或1000 k表示中心的数据,使用8GB RAM最多可以达到6000.如果要使用更大的数字,则可能希望使用bigkmeans
,SpatialTools::dist1
和fastcluster::hclust
编写更有效的版本.
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans
, SpatialTools::dist1
and fastcluster::hclust
.
这篇关于R,多重对应分析后出现层次聚类问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!