当序列包含缺口时,如何计算序列之间的差异? [英] How to compute dissimilarities between sequences when sequences contain gaps?

查看:239
本文介绍了当序列包含缺口时,如何计算序列之间的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从包含缺失的数据(即包含缺口的序列)中对与TraMineR::seqdist()最佳匹配的序列进行聚类.

I want to cluster sequences with optimal matching with TraMineR::seqdist() from data that contains missings, i.e. sequences containing gaps.

library(TraMineR)
data(ex1)
sum(is.na(ex1))

# [1] 38

sq <- seqdef(ex1[1:13])
sq

#    Sequence                 
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B      
# s3 *-D-D-D-D-D-D-D-D-D-D    
# s4 A-A-*-*-B-B-B-B-D-D      
# s5 A-*-A-A-A-A-*-A-A-A      
# s6 *-*-*-C-C-C-C-C-C-C      
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#      A-> B->   C-> D->
# A->   0 2.000   2 2.000
# B->   2 0.000   2 1.823
# C->   2 2.000   0 2.000
# D->   2 1.823   2 0.000

当我运行seqdist()

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)

我收到

Error: 'with.missing' must be TRUE when 'seqdata' or 'refseq' contains missing values

但是当我设置选项with.missing=TRUE时我会收到

but when I set option with.missing=TRUE I'm receiving

 [>] including missing values as an additional state
 [>] 7 sequences with 5 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
Error:  [!] size of substitution cost matrix must be 5x5

那么,当数据包含缺失(即序列包含缺口)时,如何使用seqdist()seqsubm()的输出正确计算序列之间的差异?

So, how can we compute the dissimilarities between sequences using seqdist() with the output of seqsubm() the right way when the data contain missings i.e. sequences contain gaps?

注意:我不确定这是否完全有意义.到目前为止,我只是排除了缺失的观测值,但是由于我的数据,我因此失去了很多观测值.因此,值得知道是否有这样的选择.

Note: I'm not very sure if this makes sense at all. So far I just exclude observations with missings but due to my data I loose lots of observations by that. Therefore it would be worthwhile to know if there's such an option.

推荐答案

有差距时,可以使用不同的策略来计算距离.

There are different strategies to computing distances when you have gaps.

1)第一个解决方案是将丢失状态视为附加状态.设置with.missing=TRUE时,这就是seqdist的作用.在那种情况下,sm矩阵应包含用缺失状态替换任何状态的成本.使用seqsubm,您也只需要为该功能指定with.missing=TRUE.默认情况下,用于替换缺失"的替换成本设置为固定值miss.cost(默认为2).

1) A first solution is to consider the missing state as an additional state. This is what seqdist does when you set with.missing=TRUE. In that case the sm matrix should contain costs of substituting any state with the missing state. Using seqsubm you just need to specify with.missing=TRUE for that function too. By default, the substitution costs to substitute 'missing' are set as the fixed value miss.cost (2 by default).

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE)
round(sm,digits=3)

#     A->   B-> C->   D-> *->
# A->   0 2.000   2 2.000   2
# B->   2 0.000   2 1.823   2
# C->   2 2.000   0 2.000   2
# D->   2 1.823   2 0.000   2
# *->   2 2.000   2 2.000   0

根据过渡概率获取缺失"的替代成本

To get substitution costs for 'missing' based on transition probabilities

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE, miss.cost.fixed=FALSE)
round(sm,digits=3)

#       A->   B->   C->   D->   *->
# A-> 0.000 2.000 2.000 2.000 1.703
# B-> 2.000 0.000 2.000 1.823 1.957
# C-> 2.000 2.000 0.000 2.000 1.957
# D-> 2.000 1.823 2.000 0.000 1.957
# *-> 1.703 1.957 1.957 1.957 0.000

使用后一个sm,我们可以获得序列之间的距离

Using the latter sm, we get the distances between sequences

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm, with.missing=TRUE)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
# [1,]  0.00 22.87 21.91 18.41  6.41 17.00 17.03
# [2,] 22.87  0.00 13.76 11.56 19.91 19.87 22.57
# [3,] 21.91 13.76  0.00 14.25 18.96 18.91 21.57
# [4,] 18.41 11.56 14.25  0.00 13.70 15.70 18.14
# [5,]  6.41 19.91 18.96 13.70  0.00 15.70 16.62
# [6,] 17.00 19.87 18.91 15.70 15.70  0.00 16.70
# [7,] 17.03 22.57 21.57 18.14 16.62 16.70  0.00

当然,序列会因为彼此共享许多缺失状态(*)而彼此接近.因此,您可能只想保留那些少于10%的元素缺失的序列.

Of course, sequences would then be close from each other just because they share many missing states (*). Therefore, you may want to retain only sequences with say less than 10 % of their elements missing.

2)第二种解决方案是删除间隙,该间隙在seqdef中执行. (但是,请注意,这会更改对齐方式.)

2) A second solution is to delete the gaps, which you do in seqdef. (However, note that this changes alignment.)

## Here, we drop seq 7 that contains only missing values

sq <- seqdef(ex1[-7,1:13], left='DEL', gaps='DEL')
sq

#    Sequence           
# s1 A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 D-D-D-D-D-D-D-D-D-D
# s4 A-A-B-B-B-B-D-D    
# s5 A-A-A-A-A-A-A-A    
# s6 C-C-C-C-C-C-C  

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#       A->   B-> C->   D->
# A-> 0.000 1.944   2 2.000
# B-> 1.944 0.000   2 1.823
# C-> 2.000 2.000   0 2.000
# D-> 2.000 1.823   2 0.000

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5] [,6]
# [1,]  0.00 19.61 20.00 13.78  2.00   17
# [2,] 19.61  0.00 12.76  9.59 17.61   17
# [3,] 20.00 12.76  0.00 13.29 18.00   17
# [4,] 13.78  9.59 13.29  0.00 11.78   15
# [5,]  2.00 17.61 18.00 11.78  0.00   15
# [6,] 17.00 17.00 17.00 15.00 15.00    0

这篇关于当序列包含缺口时,如何计算序列之间的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆