使用铅笔,但有一些约束 [英] using dplyr lead but with some contraints

查看:119
本文介绍了使用铅笔,但有一些约束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个数据帧,dat和dplyr用于添加NextStatTime字段,它是ID的结束时间之后的开始时间,持续时间是从结束时间到下一个开始的时间一个ID的时间。



数据如下所示:

  dat = data.frame(ID = c(1,1,1,2,3,3),
NumberInSequence = c(1,3,4,1,1,2),
StartTime = as.POSIXct(c(2016-01-01 05:52:05 GMT,2016-01-01 05:52:11 GMT,2016-01-01 05:52:16 GMT,2016 -01-01 05:40:05 GMT,2016-01-01 06:12:13 GMT,2016-01-01 07:12:26 GMT)),
EndTime = as。 POSIXct(c(2016-01-01 05:52:10 GMT,2016-01-01 05:52:16 GMT,2016-01-01 05:52:30 GMT, -01 05:46:05 GMT,2016-01-01 06:12:25 GMT,2016-01-01 08:00:00 GMT))


dat
dat%>%group_by(ID)%>%mutate(NextStartTime = lead(StartTime))duration = as.numeric(difftime(NextStartTime,EndTime ,units ='s')))

ID NumberInSequence StartTime EndTime NextStartTime持续时间
< dbl> < DBL> <时间> <时间> <时间> < DBL>
1 1 1年1月1日05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11 1
2 1 3 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
3 1 4 2016-01-01 05:52:16 2016-01-01 05: 52:30< NA> NA
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
6 3 2 2016-01- 01 07:12:26 2016-01-01 08:00:00< NA> NA

这是非常接近正确的答案,但如果有一个缺少的ID,它仍然计算,是



例如 - 查看ID = 1,有3个序列号为1,3和4的条目。序列中没有#2。它缺少,因此ID = 1的NextStartTime和持续时间,序列号为1的数字应为NA NOT 05:52:11和1。



有没有办法施加这个逻辑?



谢谢。

解决方案

两个选项:






tidyr :: complete



一个选项是使用 tidyr :: complete 填写缺失的行,并使用以前的方法。



下行:您新增的主要是 NA 行。尽管如此,您可以通过仔细的过滤器呼叫来省略它们。

上升:很容易编写和理解,并保留原来的逻辑。

  library(tidyverse)

dat%>%group_by ID)%>%
complete(NumberInSequence = seq(max(NumberInSequence)))%>%
mutate(NextStartTime = lead(StartTime),
持续时间= as.numeric (NextStartTime,EndTime,units ='s')))

##来源:本地数据框[7 x 6]
##组:ID [3]
# #
## ID NumberInSequence StartTime EndTime NextStartTime持续时间
##< dbl> < DBL> < DTTM> < DTTM> < DTTM> < DBL>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10< NA> NA
## 2 1 2< NA> < NA> 2016-01-01 05:52:11 NA
## 3 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
## 4 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30< NA> NA
## 5 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA
## 6 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
## 7 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00< NA> NA






子集 lead StartTime) with ifelse



ifelse 不方便剥离属性,所以你不能做 ifelse(lead(StartTime)== NumberInSequence + 1,lead(StartTime),NA)而不会重新生成结果整数回到POSIXct,这是一个麻烦。相反,如果不符合 ifelse ,传递 NA 更容易,因此向量索引返回 NA 而不是任何东西。



下行:为了保持类型,写得很好。

Upside :不添加其他行。

  dat%> ;%group_by(ID)%>%
mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence)==(NumberInSequence + 1),TRUE,NA)],
duration = difftime (NextStartTime,EndTime,units ='s')

##来源:本地数据帧[6 x 6]
##组:ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime持续时间
##< dbl> < DBL> < DTTM> < DTTM> < DTTM> <时间>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10< NA> NA secs
## 2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0秒
## 3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30< NA> NA secs
## 4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA secs
## 5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601秒
## 6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00< NA> NA secs


I have this data frame, dat, and dplyr is used to add the "NextStatTime" field which is the start time after the End time for an ID and "Duration" which is the time from the End Time to the next start time for an ID.

The data looks like this:

     dat = data.frame(ID= c(1,1,1,2,3,3),
                      NumberInSequence= c(1,3,4,1,1,2),
                      StartTime = as.POSIXct(c("2016-01-01 05:52:05 GMT","2016-01-01 05:52:11 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:40:05 GMT","2016-01-01 06:12:13 GMT","2016-01-01 07:12:26 GMT"))  ,
                      EndTime = as.POSIXct(c("2016-01-01 05:52:10 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:52:30 GMT","2016-01-01 05:46:05 GMT","2016-01-01 06:12:25 GMT","2016-01-01 08:00:00 GMT")  )
                       )

    dat
    dat %>% group_by(ID) %>% mutate(NextStartTime = lead(StartTime), duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))

  ID NumberInSequence           StartTime             EndTime       NextStartTime duration
  <dbl>            <dbl>              <time>              <time>              <time>    <dbl>
1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11        1
2     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16        0
3     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>       NA
4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>       NA
5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26     3601
6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>       NA

That is very close to the right answer but if there is a missing ID it still calculates and is misleading.

For example - look at ID= 1 there are 3 entries with sequence numbers 1,3 and 4. There is no #2 in the sequence. It is missing so the NextStartTime and Duration for ID = 1 and Number in sequence = 1 should be NA NOT 05:52:11 and 1.

Is there a way to impose this logic?

Thank you.

解决方案

Two options:


tidyr::complete

One option is to use tidyr::complete to fill in the missing rows, and use the previous method.

Downside: You get new mostly NA rows added. You could omit them after the fact with a careful filter call, though.
Upside: It's easy to write and understand, and preserves the original logic.

library(tidyverse)

dat %>% group_by(ID) %>% 
    complete(NumberInSequence = seq(max(NumberInSequence))) %>% 
    mutate(NextStartTime = lead(StartTime), 
           Duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))

## Source: local data frame [7 x 6]
## Groups: ID [3]
## 
##      ID NumberInSequence           StartTime             EndTime       NextStartTime Duration
##   <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <dbl>
## 1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10                <NA>       NA
## 2     1                2                <NA>                <NA> 2016-01-01 05:52:11       NA
## 3     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16        0
## 4     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>       NA
## 5     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>       NA
## 6     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26     3601
## 7     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>       NA


subset lead(StartTime) with ifelse

ifelse inconveniently strips attributes, so you can't do ifelse(lead(StartTime) == NumberInSequence + 1, lead(StartTime), NA) without reconverting the resulting integer back to POSIXct, which is a hassle. Instead, it's easier to subset with ifelse, passing an NA if it's not a match, so the vector indexed returns NA instead of nothing.

Downside: It's finicky to write in order to keep types.
Upside: No additional rows are added.

dat %>% group_by(ID) %>% 
    mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)], 
           duration = difftime(NextStartTime, EndTime, units = 's'))

## Source: local data frame [6 x 6]
## Groups: ID [3]
## 
##      ID NumberInSequence           StartTime             EndTime       NextStartTime  duration
##   <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <time>
## 1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10                <NA>   NA secs
## 2     1                3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16    0 secs
## 3     1                4 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>   NA secs
## 4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>   NA secs
## 5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
## 6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>   NA secs

这篇关于使用铅笔,但有一些约束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆