将树状图切成n个树,R中的簇最小 [英] Cutting dendrogram into n trees with minimum cluster size in R
问题描述
我正在尝试使用三级聚类(特别是hclust
)将数据集聚为10个组,成员的数量为100个或更少,并且没有一个组的总数超过40%.我目前所知道的唯一方法是重复使用cut()
并连续选择较低的h值,直到对分割的散布感到满意为止.但是,这迫使我返回并重新整理我修剪的组,以将它们聚合为100个成员组,这非常耗时.
I'm trying to use hirearchical clustering (specifically hclust
) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut()
and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.
我已经尝试过dynamicTreeCut
软件包,但无法弄清楚如何输入这些(相对简单的)限制.我使用deepSplit
来指定分组数量,但是按照文档,这将最大数量限制为4.对于下面的练习,我要做的就是将群集分为5个组3个或更多的个人(我可以自己处理最大大小限制,但是如果您也想尝试解决这个问题,那将是有帮助的!).
I've experimented with the dynamicTreeCut
package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit
as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).
这是我的示例,使用Orange
数据集.
Here's my example, using the Orange
dataset.
library(dynamicTreeCut)
library(reshape2)
##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)
####casting the data to make a correlation matrix, and then running
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))
###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)
dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
如您所见,它仅指定两个群集.对于我的练习,我想避免使用明确的高度术语来砍伐树木,因为我想要的是k
棵树木.
As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k
number of trees instead.
推荐答案
1-找出最合适的相异性度量(例如,"euclidean"
,"maximum"
,"manhattan"
,"binary"
或"minkowski"
)和链接方法(例如,"ward"
,"single"
,"complete"
,"average"
,"mcquitty"
,"median"
或"centroid"
)基于您的数据的性质和聚类的目标.有关更多详细信息,请参见?dist
和?hclust
.
1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean"
, "maximum"
, "manhattan"
, "canberra"
, "binary"
, or "minkowski"
) and linkage method (e.g., "ward"
, "single"
, "complete"
, "average"
, "mcquitty"
, "median"
, or "centroid"
) based on the nature of your data and the objective(s) of clustering. See ?dist
and ?hclust
for more details.
2-在开始切割步骤之前,绘制树状图.有关更多详细信息,请参见?hclust
.
2- Plot the dendogram tree before starting the cutting step. See ?hclust
for more details.
3-使用dynamicTreeCut
程序包中的混合自适应树切割方法,并调整形状参数(maxCoreScatter
和minGap
/maxAbsCoreScatter
和minAbsGap
).参见Langfelder等. 2009( http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement. pdf ).
3- Use the hybrid adaptive tree cut method in dynamicTreeCut
package, and tune the shape parameters (maxCoreScatter
and minGap
/ maxAbsCoreScatter
and minAbsGap
). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).
例如,
1-相应地更改"euclidean"
和/或"complete"
方法,
1- Change "euclidean"
and/or "complete"
methods as appropriate,
orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")
2-绘制树状图,
plot(orangeClust)
3-使用混合树切割方法并调整形状参数,
3- Use the hybrid tree cut method and tune shape parameters,
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
作为调整形状参数的指导,默认值为
As a guide for tuning the shape parameters, the default values are
deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4
如您所见,maxCoreScatter
和minGap
都应在0
和1
之间,并且增大maxCoreScatter
(减小minGap
)会增加簇的数量(较小的簇).这些参数的含义在Langfelder等人的文章中有所描述. 2009年.
As you can see, both maxCoreScatter
and minGap
should be between 0
and 1
, and increasing maxCoreScatter
(decreasing minGap
) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.
例如,获得更多的较小簇
For example, to get more smaller clusters
maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0
最后,您的聚类约束(大小,高度,数量等)应是合理且可解释的,并且生成的聚类应与数据一致.这将引导您进入群集验证和解释的重要步骤.
Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.
祝你好运!
这篇关于将树状图切成n个树,R中的簇最小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!