子序列的数量奇怪吗? [英] Strange number of subsequences?

查看:153
本文介绍了子序列的数量奇怪吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样创建的序列对象:

I have a sequence object created like this:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

哪个给出输出:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

但是当我计算子序列时,我得到了看似荒谬的答案:

But when I calculate the subsequences, I get seemingly ridiculous answers:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

子序列的数量怎么会比每个序列中的事件数更长呢?

How could the number of subsequences be far longer than the number of events in each sequence?

可以在此处找到数据集的'dput()'.问题似乎在于原始数据具有以秒为单位的时间戳.但是,我使用下面的函数来将时间戳更改为简单的顺序:

A 'dput()' of the dataset can be found here. The issue seems to be that the original data has time stamps in seconds. However, I've used the function below in order to change the timestamps to simply be sequential:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

这使得可以为熵,序列长度等创建适当的度量,但是子序列的数量仍然有问题.

This makes it possible to create appropriate measures for entropy, sequence length etc., but the number of subsequences is still problematic.

推荐答案

返回的子序列数量一点也不奇怪.这是子序列"的定义问题,不应与子字符串"混淆.

The number of subsequences returned are not surprising at all. It is a matter of definition of 'subsequence', which should not be confused with 'substring'.

序列$ x =(x_1,x_2,...,x_3)$是$ y $的子序列,如果其元素$ x_i $都在$ y $中并且以与$ y $相同的顺序出现.例如,A-B-A是C-A-D-B-C-D-A-D的子序列.

A sequence $x = (x_1, x_2, ... , x_3)$ is a subsequence of $y$ if its elements $x_i$ are all in $y$ and occur in the same order as in $y$. For instance, A-B-A is a subsequence of C-A-D-B-C-D-A-D.

为说明起见,请考虑TraMineR软件包中的"mvad"示例.

To illustrate, consider the `mvad' example from the TraMineR package.

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

默认情况下,seqsubsn计算不同连续状态(DSS)的子序列数.第一序列的DSS例如是EM-TR-EM. EM-TR-EM的七个子序列是:

By default, seqsubsn computes the number of subsequences of the distinct successive states (DSS). The DSS of the first sequence, for example, is EM-TR-EM. The seven subsequences of EM-TR-EM are:

  • 空序列
  • 由单个元素组成的两个序列:EM和TR
  • 两个长度的子序列:EM-TR,EM-EM,TR-EM
  • 三个长度的序列:EM-TR-EM

使用相同的方法来验证您的第四个序列(等于其DSS)

Proceeding the same way you can verify that your fourth sequence (that is equal to its DSS)

*-opened-*-discussed-merged-discussed

具有49个子序列,其中9个两个长度的子序列:

has 49 subsequences, of which the nine two-length subsequences:

*-open*-discussed*-mergedopened-*opened-discussedopened-mergeddiscussed-mergeddiscussed-discussedmerged-discussed

*-open, *-discussed, *-merged, opened-*, opened-discussed, opened-merged, discussed-merged, discussed-discussed, merged-discussed

希望这对您有帮助

这篇关于子序列的数量奇怪吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆