从数据框中选择子序列 [英] Select subsequences from a dataframe

查看:74
本文介绍了从数据框中选择子序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

  df<-structure(list(a = c(1,43,22 ,12,35,113,54,94),b = c( a,
b, c, d, e, f, g, h ))。.names = c( a, b),row.names = c(NA,
-8L),class = c( tbl_df, tbl, data.frame ))

我想从此数据中选择一定长度的连续子序列。例如,对于两个长度的序列,我想选择1-2、2-3、3-4等行,直到数据帧的最后一行。然后应标记每个子序列。



子序列长度为2,新的 df 及其序列标签如下所示:

  ab seq_label 
1 a 1#第一个子序列,第1-2行
43 b 1#
43 b 2#第二个子序列,第2-3行
22 c 2#
22 c 3#第三个子序列,第3-4行
12 d 3#
12 d 4
35 e 4
35 e 5
113 f 5
113 f 6
54 g 6
54 g 7
94 h 7
b的子序列长度类似:3

  ab seq_label 
1 a 1#第一个子序列,第1-3行
43 b 1#
22 c 1#
43 b 2#第二个子列,第2-4行
22 c 2#
12 d 2#
22 c 3#第三子序列,第3-5行
12 d 3#
35 e 3#
12 d 4
35 e 4
113 f 4
35 e 5
113 f 5
54 g 5
113 f 6
54 g 6
94 h 6

....



感谢@drjones的建议答案I已提出解决方案:

  map_dfr(1:(nrow(df)-n + 1),函数(i){cbind (df [i:(i + n-1),], seq_label = i)})


解决方案

我们可以使用外部创建索引:

  n<-2 
i<-1:1:(nrow(df)-(n-1))

cbind(df [t(outer(i ,1:n-1,`+`)),],
seq_label = rep(i,每个= n))
#ab seq_label
#1 1 a 1
#2 43 b 1
#3 43 b 2
#4 22 c 2
#5 22 c 3
#6 12 d 3
#7 12 d 4
#8 35 e 4
#9 35 e 5
#10113 f 5
#11113 f 6
#12 54 g 6
# 13 54 g 7
#1494 h 7






...或 kronecker

  cbind(df [kronecker(X = i,Y = 1:n-1,FUN =`+`),],
seq_label = rep(i,每个= n ))






...或嵌入

  i <-:1:nrow(df)
cbind (df [as.vector(t(embed(i,n)[,n:1])),],
seq_label = rep(head(i,-(n-1)),每个= n) )


I have the following dataframe:

df <- structure(list(a = c(1, 43, 22, 12, 35, 113, 54, 94), b = c("a", 
"b", "c", "d", "e", "f", "g", "h")), .Names = c("a", "b"), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

From this data I want to select consecutive subsequences of a certain length. For example, for a sequence length of two, I want to select rows 1-2, 2-3, 3-4, and so on until the last row of the data frame. Each subsequence should then be labelled.

With a subsequence length of 2, new df with its sequence labels would look like this:

a   b   seq_label
1   a   1 # First subsequence, row 1-2      
43  b   1 # 
43  b   2 # Second subsequence, row 2-3     
22  c   2 #         
22  c   3 # Third subsequence, row 3-4
12  d   3 #     
12  d   4
35  e   4       
35  e   5
113 f   5       
113 f   6
54  g   6       
54  g   7
94  h   7

Similar with a subsequence length of 3:

a   b  seq_label
1   a  1 # First subsequence, row 1-3
43  b  1 #          
22  c  1 #
43  b  2 # Second subsequence, row 2-4
22  c  2 #
12  d  2 #
22  c  3 # Third subsequence, row 3-5
12  d  3 #
35  e  3 #
12  d  4
35  e  4
113 f  4
35  e  5
113 f  5
54  g  5
113 f  6
54  g  6
94  h  6

....

Thanks for @drjones's suggested answer I have advanced the solution:

map_dfr(1:(nrow(df) - n + 1), function (i) {cbind(df[i:(i + n - 1), ], "seq_label" = i)})

解决方案

We may create the indices using outer:

n <- 2
i <- 1:(nrow(df) - (n - 1))

cbind(df[t(outer(i, 1:n - 1, `+`)), ],
      seq_label = rep(i, each = n))
#      a b seq_label
# 1    1 a         1
# 2   43 b         1
# 3   43 b         2
# 4   22 c         2
# 5   22 c         3
# 6   12 d         3
# 7   12 d         4
# 8   35 e         4
# 9   35 e         5
# 10 113 f         5
# 11 113 f         6
# 12  54 g         6
# 13  54 g         7
# 14  94 h         7


...or kronecker:

cbind(df[kronecker(X = i, Y = 1:n - 1, FUN = `+`), ],
      seq_label = rep(i, each = n))


...or embed:

i <- 1:nrow(df)
cbind(df[as.vector(t(embed(i, n)[ , n:1])), ],
      seq_label = rep(head(i, -(n - 1)), each = n))

这篇关于从数据框中选择子序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆