连续出现在数据帧中 [英] Consecutive occurrence in a data frame

查看:56
本文介绍了连续出现在数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有上面的数据框,其中包含不同的测量值.我想确定一次在 t 处进行的 w 的连续测量(长度大小等于或大于6).例如,对于来自 t3:t8 id 1 ,记录了 6 个连续的 w 度量./p>

我想将结果保存到2个数据帧中:

  df1:在第一次出现w之前,至少进行了6次w的连续测量(每个id);df2:从最近一次出现w的时间开始(每个id),连续少于6次测量w; 

具有和不具有连续w出现的数据集的格式:

  id t1 t2 t3 t4 t5 t6 t7 t8 t9 t101 s w w w w w w w w w#7 occ.t3之后的w的2 s w w w w w w w w w#没有连续6次出现3 ww w w w s s r#6 occ.t6之前的w的4#9 ocw w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w w wt1之后的w的5 ww w w w w r r w w#6 occ.t7之前的w的6没有连续发生6次 

输出:

  w之前:id t1 t2 t3 t4 t5 t6 t7 t8 t9 103秒5周w之后:id t1 t2 t3 t4 t5 t6 t7 t8 t9 101秒4周 

样本数据:

  df <-structure(list(id = c(1,2,3,4,5,6),t1 = c("s","s","w",; e,w,w,t2 = c("s","w","w","w","w","s"),t3 = c("w","w","w","w","w","w")),t4 = c("w","w","w","w","w","r"),t5 = c("w","e","; w,w,w,w,t6 = c("w","w","w","w","w","; r),t7 = c("w","w","s","w","r","w"),t8 = c("w","w",";"s","w","w","w"),t9 = c("e","w","s","w","w","; s''),t10 = c("w","w","r","w","w","w")),row.names = c(NA,6L),类="data.frame") 

代码:

之前(至少连续6个时间步无效):

  df1<-dfdf1 [-1]<-t(apply(df [-1],1,function(x)replace(x,seq_along(x)> match('w',x),``))))df1< -df1 [rowSums(df1 =='w')!= 0,,drop = FALSE] 

之后(至少连续6个时间步均无效):

  df2<-dfdf2 [-1]<-t(apply(df [-1],1,function(x)replace(x,seq_along(x)< = match('w',x),``))))df2<-df2 [c(TRUE,colSums(df2 [-2]!='')> 0)]]df2< -df2 [rowSums(df2 =='w')!= 0,,drop = FALSE] 

解决方案

不是很聪明,也更具有实验性,但是您可以尝试:

 库(tidyverse)df< -ivot_longer(df,-id)%>%group_by(id,idx = rep(1:length(rle(value)$ length),times = rle(value)$ length))%&%filter(any(cumsum(value =='w')== 6& value =='w')| value!='w')%&%group_by(id)%>%select(-idx)%&%filter(any(value =='w'))%&%;%mutate(w_consec = cumsum(value =='w'),组= case_when(any(值!='w'& w_consec == 0)〜'之后',任何(值!='w'& w_consec == 6)〜'之前'))%&%;%筛选(if(any(group =='After'))(value =='w'& w_consec == 1)|(值!='w'& w_consec == 0)否则w_consec == 6)%&%;%ivot_wider(id_cols = c('id','group'),names_from =名称,values_from =值) 

在第二步中通过对 idx 变量进行分组,我们确保只保留出现的 w ,它们属于连续的6个重复序列.否则,我们可能会遇到一个问题,即使用示例序列 wwwwwwebww ,我们将丢失 eb 信息,因为所有 w 都将包含在下一步中,因此以单个 w 结尾.在这种情况下,使用 rle 函数为所有连续出现的任何字符分配相同的值(上面使用的方式与 data.table :: rleid 功能,您可以在帮助页面上查看后者以获得更多上下文).

之后,您可以使用 split :

  split(df,df $ group) 

输出:

  $ After#小动作:2 x 10#个群组:id [2]id组t1 t2 t3 t6 t7 t8 t9 t10< dbl>< chr>< chr>< chr>< chr>< chr>< chr>< chr>< chr>< chr>1 1结束后不适用不适用不适用不适用2 4在w后不适用不适用不适用不适用不适用$之前#小动作:2 x 10#个群组:id [2]id组t1 t2 t3 t6 t7 t8 t9 t10< dbl>< chr>< chr>< chr>< chr>< chr>< chr>< chr>< chr>< chr>1 3在不适用之前不适用2 5之前不适用不适用w w不适用不适用不适用 

如果您希望将其作为单独的数据框架包含在您的环境中:

  list2env(split(df,df $ group),.GlobalEnv) 

I have the above data frame containing different measurements. I would like to identify consecutive measurements (with the length size of more or equal with 6) of w taken at a time t. For example, in the case of id 1 from t3:t8 there are 6 consecutive w measures recorded.

I would like to save the results into 2 data frames:

df1: At least 6 consecutive measurements of w (per id) before the first occurrence of w;
df2: From timing of the last occurrence of w (per id) there are less than 6 consecutive measurements of w;
    

The format of my dataset with and without consecutive w occurrences:

 id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
  1  s  s  w  w  w  w  w  w  w  w #7 occ. of w after t3
  2  s  w  w  w  e  w  w  w  w  w  #no 6 consecutive w occurance
  3  w  w  w  w  w  w  s  s  s  r #6 occ. of w before t6
  4  e  w  w  w  w  w  w  w  w  w #9 occ. of w after t1
  5  w  w  w  w  w  w  r  w  w  w #6 occ. of w before t7
  6  w  s  w  r  w  r  w  w  s  w #no 6 consecutive w occurance

Output:

Before w:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 10
3                  w  s  s  s  r
5                  w  r  
   
After w:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 10
1   s  s  w
4   e  w

Sample data:

df<-structure(list(id=c(1,2,3,4,5,6), t1=c("s","s","w","e","w","w"), t2=c("s","w","w","w","w","s"),t3 = c("w","w","w","w","w","w"),
                        t4 = c("w","w","w","w","w","r"), t5 = c("w","e","w","w","w","w"), t6 = c("w","w","w","w","w","r"),
                       t7= c("w","w","s","w","r","w"), t8 = c("w","w","s","w","w","w"), t9=c("e","w","s","w","w","s"), t10=c("w","w","r","w","w","w")), row.names = c(NA, 6L), class = "data.frame")
    

Codes:

Before (Not working for at least 6 consecutive time steps):

df1 <- df
df1[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) > match('w', x), '')))
df1<-df1[rowSums(df1 == 'w')!=0,  ,drop = FALSE]

After (Not working for at least 6 consecutive time steps):

df2 <- df
df2[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) <= match('w', x), '')))

df2 <- df2[c(TRUE, colSums(df2[-2] != '') > 0)]
df2<-df2[rowSums(df2 == 'w')!=0,  ,drop = FALSE]

解决方案

Not very smart and more experimental, but you could try:

library(tidyverse)

df <- pivot_longer(df, -id) %>%
  group_by(id, idx = rep(1:length(rle(value)$length), times = rle(value)$length)) %>%
  filter(any(cumsum(value == 'w') == 6 & value == 'w') | value != 'w') %>%
  group_by(id) %>% select(-idx) %>%
  filter(any(value == 'w')) %>%
  mutate(w_consec = cumsum(value == 'w'),
         group = case_when(
           any(value != 'w' & w_consec == 0) ~ 'After',
           any(value != 'w' & w_consec == 6) ~ 'Before')) %>%
  filter(
    if (any(group == 'After')) (value == 'w' & w_consec == 1) | (value != 'w' & w_consec == 0)
    else w_consec == 6
    ) %>%
  pivot_wider(id_cols = c('id', 'group'), names_from = name, values_from = value)

With grouping by idx variable in the second step, we ensure that we only keep occurrences of w which belong to a consecutive set of 6 repeats. Otherwise we could run into an issue where with example sequence wwwwwwebww, we would lose eb information as all w would be included in next steps, thus ending with a single w. rle function is used in this case to assign the same value to all consecutive occurrences of any character (the way it is used above has the same behaviour as data.table::rleid function, you can check help page for the latter to get more context).

After that, you can use split:

split(df, df$group)

Output:

$After
# A tibble: 2 x 10
# Groups:   id [2]
     id group t1    t2    t3    t6    t7    t8    t9    t10  
  <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     1 After s     s     w     NA    NA    NA    NA    NA   
2     4 After e     w     NA    NA    NA    NA    NA    NA   

$Before
# A tibble: 2 x 10
# Groups:   id [2]
     id group  t1    t2    t3    t6    t7    t8    t9    t10  
  <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1     3 Before NA    NA    NA    w     s     s     s     r    
2     5 Before NA    NA    NA    w     r     NA    NA    NA   

If you want to include it within your environment as separate data frames:

list2env(
  split(df, df$group), .GlobalEnv
)

这篇关于连续出现在数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆