R,多重对应分析后出现层次聚类问题 [英] R, issue with a Hierarchical clustering after a Multiple correspondence analysis

查看:121
本文介绍了R,多重对应分析后出现层次聚类问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想聚类一个数据集(600000个观测值),对于每个聚类,我都想要获得主要成分. 我的向量由一封电子邮件和30个定性变量组成. 每个定量变量都有4个类别:0、1、2和3.

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components. My vectors are composed by one email and by 30 qualitative variables. Each quantitative variable has 4 classes: 0,1,2 and 3.

所以我要做的第一件事是加载库FactoMineR并加载我的数据:

So first thing I'm doing is to load the library FactoMineR and to load my data:

library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")

然后,我将变量设置为定性的(尽管我排除了变量'email'):

Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):

for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}

我正在从引导程序中删除电子邮件:

I'm removing the emails from my vectors:

mydata2 = mydata[2:31]

我正在此新数据集中运行MCA:

And I'm running a MCA in this new dataset:

mca.res <- MCA(mydata2)

我现在想使用hcpc函数对数据集进行聚类:

I now want to cluster my dataset using the hcpc function:

res.hcpc <- HCPC(mca.res)

但是我收到以下错误消息:

But I got the following error message:

Error: cannot allocate vector of size 1296.0 Gb

您认为我应该怎么做?我的数据集太大了吗?我是否很好地使用了hcpc函数?

What do you think I should do? Is my dataset too large? Am I using well the hcpc function?

推荐答案

由于使用分层聚类,因此HCPC需要计算600000 x 600000距离矩阵(约1800亿个元素)的下三角.您只是没有RAM来存储该对象,即使您这样做了,计算也可能要花费数小时甚至数天才能完成.

Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.

关于对大型数据集进行聚类的堆栈溢出/交叉验证的讨论很多.在R中具有解决方案的一些软件包括:

There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:

R上的k-均值聚类非常大,稀疏矩阵?(bigkmeans)

>在R中聚集大数据并且与采样有关?(clara)

如果要使用这些替代群集方法之一,则可以在示例中将其应用于mca.res$ind$coord.

If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.

为解决问题在R中聚集非常大的数据集而提出的另一种想法,是首先使用k表示找到一定数量的聚类中心,然后使用层次聚类从那里构建树.该方法实际上是通过HCPCkk参数实现的.

Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.

例如,使用FactoMineR中的tea数据集:

For example, using the tea data set from FactoMineR:

library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)

consol参数提供了使用k-means合并来自层次聚类的聚类的选项.当kk设置为实数时,此选项不可用,因此,此处consol设置为FALSE.将对象res.consul设置为NULL可以解决FactoMineR 1.27中的一个小错误.

The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.

下图显示了在前两个MCA轴上绘制的数据基于300个个体(kk = Inf)和30 k均值中心(kk = 30)的聚类:

The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:

可以看出结果非常相似.您应该可以轻松地将其应用于600或1000 k表示中心的数据,使用8GB RAM最多可以达到6000.如果要使用更大的数字,则可能希望使用bigkmeansSpatialTools::dist1fastcluster::hclust编写更有效的版本.

It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.

这篇关于R,多重对应分析后出现层次聚类问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆