将树状图切成n个树,R中的簇最小 [英] Cutting dendrogram into n trees with minimum cluster size in R

查看:289
本文介绍了将树状图切成n个树,R中的簇最小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用三级聚类(特别是hclust)将数据集聚为10个组,成员的数量为100个或更少,并且没有一个组的总数超过40%.我目前所知道的唯一方法是重复使用cut()并连续选择较低的h值,直到对分割的散布感到满意为止.但是,这迫使我返回并重新整理我修剪的组,以将它们聚合为100个成员组,这非常耗时.

I'm trying to use hirearchical clustering (specifically hclust) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.

我已经尝试过dynamicTreeCut软件包,但无法弄清楚如何输入这些(相对简单的)限制.我使用deepSplit来指定分组数量,但是按照文档,这将最大数量限制为4.对于下面的练习,我要做的就是将群集分为5个组3个或更多的个人(我可以自己处理最大大小限制,但是如果您也想尝试解决这个问题,那将是有帮助的!).

I've experimented with the dynamicTreeCut package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).

这是我的示例,使用Orange数据集.

Here's my example, using the Orange dataset.

library(dynamicTreeCut)
library(reshape2)

##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)


####casting the data to make a correlation matrix, and then running 
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))

###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)

dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

如您所见,它仅指定两个群集.对于我的练习,我想避免使用明确的高度术语来砍伐树木,因为我想要的是k棵树木.

As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k number of trees instead.

推荐答案

1-找出最合适的相异性度量(例如,"euclidean""maximum""manhattan""binary""minkowski")和链接方法(例如,"ward""single""complete""average""mcquitty""median""centroid")基于您的数据的性质和聚类的目标.有关更多详细信息,请参见?dist?hclust.

1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski") and linkage method (e.g., "ward", "single", "complete", "average", "mcquitty", "median", or "centroid") based on the nature of your data and the objective(s) of clustering. See ?dist and ?hclust for more details.

2-在开始切割步骤之前,绘制树状图.有关更多详细信息,请参见?hclust.

2- Plot the dendogram tree before starting the cutting step. See ?hclust for more details.

3-使用dynamicTreeCut程序包中的混合自适应树切割方法,并调整形状参数(maxCoreScatterminGap/maxAbsCoreScatterminAbsGap).参见Langfelder等. 2009( http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement. pdf ).

3- Use the hybrid adaptive tree cut method in dynamicTreeCut package, and tune the shape parameters (maxCoreScatter and minGap / maxAbsCoreScatter and minAbsGap). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).

例如,

1-相应地更改"euclidean"和/或"complete"方法,

1- Change "euclidean" and/or "complete" methods as appropriate,

orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")

2-绘制树状图,

plot(orangeClust)

3-使用混合树切割方法并调整形状参数,

3- Use the hybrid tree cut method and tune shape parameters,

dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0


作为调整形状参数的指导,默认值为


As a guide for tuning the shape parameters, the default values are

deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4

如您所见,maxCoreScatterminGap都应在01之间,并且增大maxCoreScatter(减小minGap)会增加簇的数量(较小的簇).这些参数的含义在Langfelder等人的文章中有所描述. 2009年.

As you can see, both maxCoreScatter and minGap should be between 0 and 1, and increasing maxCoreScatter (decreasing minGap) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.

例如,获得更多的较小簇

For example, to get more smaller clusters

maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
 ..cutHeight not given, setting it to 1.8  ===>  99% of the (truncated) height range in dendro.
 ..done.
 2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0


最后,您的聚类约束(大小,高度,数量等)应是合理且可解释的,并且生成的聚类应与数据一致.这将引导您进入群集验证和解释的重要步骤.


Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.

祝你好运!

这篇关于将树状图切成n个树,R中的簇最小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆