如何识别每个簇中的序列? [英] How to identify sequences within each cluster?

查看:144
本文介绍了如何识别每个簇中的序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用作为TraMineR一部分的biofam数据集:

Using the biofam dataset that comes as part of TraMineR:

library(TraMineR)
data(biofam)
lab <- c("P","L","M","LM","C","LC","LMC","D")
biofam.seq <- seqdef(biofam[,10:25], states=lab)
head(biofam.seq)
     Sequence                                    
1167 P-P-P-P-P-P-P-P-P-LM-LMC-LMC-LMC-LMC-LMC-LMC
514  P-L-L-L-L-L-L-L-L-L-L-LM-LMC-LMC-LMC-LMC    
1013 P-P-P-P-P-P-P-L-L-L-L-L-LM-LMC-LMC-LMC      
275  P-P-P-P-P-L-L-L-L-L-L-L-L-L-L-L             
2580 P-P-P-P-P-L-L-L-L-L-L-L-L-LMC-LMC-LMC       
773  P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P 

我可以执行聚类分析:

library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))

但是,在此过程中,biofam.seq中的唯一ID已被数字1到N的列表取代:

However, in this process, the unique id's from biofam.seq have been replaced by a list of numbers 1 through N:

head(cluster3, 10)
[1] Type 1 Type 2 Type 2 Type 2 Type 2 Type 3 Type 3 Type 2 Type 1
[10] Type 2
Levels: Type 1 Type 2 Type 3

现在,我想知道每个聚类中包含哪些序列,以便我可以应用其他函数来获取每个聚类中的平均长度,熵,子序列,不相似性等.我需要做的是:

Now, I want to know which sequences are within each cluster, so that I can apply other functions to get the mean length, entropy, subsequence, dissimilarity, etc. within each cluster. What I need to do is:

  1. 将旧ID映射到新ID
  2. 将每个簇中的序列插入到单独的序列对象中
  3. 在每个新序列对象上运行所需的统计信息

如何完成上面列表中的2和3?

How can I accomplish 2 and 3 in the list above?

推荐答案

例如,第一个集群的状态序列对象可以简单地通过

The state sequence object for the first cluster, for example, can simply be obtained with

bio1.seq <- biofam.seq[cluster3=="Type 1",]
summary(bio1.seq)

这篇关于如何识别每个簇中的序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆