R - 按数据框中的组识别行元素序列 [英] R - Identify a sequence of row elements by groups in a dataframe

查看:11
本文介绍了R - 按数据框中的组识别行元素序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下示例数据帧:

> df
   id name time
1   1    b   10
2   1    b   12
3   1    a    0
4   2    a    5
5   2    b   11
6   2    a    9
7   2    b    7
8   1    a   15
9   2    b    1
10  1    a    3

df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L), 
    name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
    ), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id", 
"name", "time"), row.names = c(NA, -10L), class = "data.frame")

我需要识别并记录所有序列 seq <- c("a","b"),其中基于时间"列,a"在b"之前,对于每个ID.a"和b"之间不允许有其他名称.实际序列长度至少为 5.样本数据的预期结果是

I need to identify and record all sequences seq <- c("a","b"), where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5. The expected result for the sample data is

  a  b
1 3 10
2 5  7
3 9 11

有一个类似的问题 在 R 数据框中查找列值遵循序列的行.但是,在我的情况下,我不清楚如何处理id"列.有没有办法使用dplyr"解决问题?

There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?

推荐答案

library(dplyr); library(tidyr)

# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>% 

       # get logical vector indicating rows of a followed by b and mark each pair as unique
       # by cumsum
       mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>% 

       # subset rows where conditions are met
       filter(ab | lag(ab)) %>% 

       # reshape your data frame to wide format
       select(-ab) %>% spread(name, time)


#Source: local data frame [3 x 4]
#Groups: id [2]

#     id     g     a     b
#* <int> <int> <int> <int>
#1     1     1     3    10
#2     2     1     5     7
#3     2     2     9    11

<小时>

如果序列的长度大于 2,则需要检查多个滞后,其中一种选择是使用 shift 函数(接受向量作为滞后/超前步长)来自data.table 结合Reduce,如果我们需要检查模式abb:


If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift function(which accepts a vector as lag/lead steps) from data.table combined with Reduce, say if we need to check pattern abb:

library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)

df %>% arrange(id, time) %>% group_by(id) %>% 

       # same logic as before but use Reduce function to check multiple lags condition
       mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)), 
              g = cumsum(ab)) %>% 

       # use reduce or to subset sequence rows having the same length as the pattern
       filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>% 

       # make unique names
       group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>% 

       # pivoting the table to wide format
       select(-ab) %>% spread(name, time) 

#Source: local data frame [1 x 5]
#Groups: id, g [1]

#     id     g   a_1   b_2   b_3
#* <int> <int> <int> <int> <int>
#1     1     1     3    10    12

这篇关于R - 按数据框中的组识别行元素序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆