查找序列中的特定模式 [英] Find specific patterns in sequences

查看:81
本文介绍了查找序列中的特定模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R包TraMineR进行一些学术研究序列分析.

I'm using R package TraMineR to make some academic research sequence analysis.

我想找到一种模式,该模式定义为某人在目标公司中,然后走出去,然后回到目标公司.

I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.

(简体),我已将状态A定义为目标公司; B是外部行业公司,C是内部行业公司.

(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.

所以我要做的是找到具有特定模式A-B-A或A-C-A的序列.

So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.

在查看了这个问题之后(子序列数奇怪吗?)阅读用户指南,特别是以下几节:

After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:

4.3.3子序列 如果u的所有连续元素ui在同一位置>在x中出现,则序列u是x的子序列 阶,我们简单地用u x表示.根据此定义,可以显示未共享的>状态 在序列u和x共有的那些之间.例如,u = S; M是>的子序列 x = S; U; M; MC.

4.3.3 Subsequences A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same order, which we simply denote by u x. According to this de nition, unshared >states can appear between those common to both sequences u and x. For example, u = S; M is a >subsequence of x = S; U; M; MC.

7.3.2查找具有给定子序列的序列 seqpm()函数计算包含给定子序列的序列数并收集 他们的行索引号.该函数返回包含两个元素的列表.第一个元素MTab 只是一个表,其中包含数据中给定子序列的出现次数.注意 即使每个子序列出现的次数超过一个,每个序列也只会计算一次出现 时间顺序.列表的第二个元素MIndex给出了行索引号 包含子序列的序列.这些索引号对于访问 有关序列(下面的示例).由于更容易搜索字符串中的图案, 在将seqconc函数与 TRUE选项.

7.3.2 Finding sequences with a given subsequence The seqpm() function counts the number of sequences that contain a given subsequence and collects their row index numbers. The function returns a list with two elements. The rst element, MTab, is just a table with the number of occurrences of the given subsequence in the data. Note that only one occurrence is counted per sequence, even when the sub-sequence appears more than one time in the sequence. The second element of the list, MIndex, gives the row index numbers of the sequences containing the subsequence. These index numbers may be useful for accessing the concerned sequences (example below). Since it is easier to search a pattern in a character string, the function rst translates the sequence data in this format when using the seqconc function with the TRUE option.

我得出结论,seqpm()是完成工作所需的功能.

I concluded that seqpm() was the function I needed to get the job done.

所以我有类似的序列: A-A-A-A-A-B-B-B-B-B-A-A-A-A-A

So I have sequences like: A-A-A-A-A-B-B-B-B-B-A-A-A-A-A

根据我在精神病学资料来源中发现的子序列的定义,我认为我可以使用以下方法找到这种序列:

And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:

seqpm(sequence,"ABA")

但这不会发生.为了找到示例序列,我需要输入

But that does not happen. In order to find that example sequence i need to input

seqpm(sequence,"ABBBBBA")

对于我所需要的不是很有用.

which is not very useful for what I need.

  1. 那么你们看到我可能错过了什么吗?
  2. 如何检索从A到B再回到A的所有序列?
  3. 有没有办法让我找到从A到其他任何东西然后再回到A的地方?

非常感谢!

推荐答案

seqpm帮助页面的标题是在序列中查找子字符串模式",这实际上是该函数的作用.它搜索包含给定子字符串(而不是子序列)的序列.用户指南中似乎存在配方错误.

The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.

一种查找包含给定子序列的序列的解决方案是使用seqecreate将状态序列转换为事件序列,然后使用seqefsubseqeapplysub函数.我说明了如何使用TraMineR附带的actcal数据.

A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.

library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])

## displaying the first state sequences
head(actcal.seq)

## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)

## displaying the first event sequences
head(actcal.seqe)

## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)

## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]

希望这会有所帮助.

这篇关于查找序列中的特定模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆