使用铅笔,但有一些约束 [英] using dplyr lead but with some contraints
问题描述
数据如下所示:
dat = data.frame(ID = c(1,1,1,2,3,3),
NumberInSequence = c(1,3,4,1,1,2),
StartTime = as.POSIXct(c(2016-01-01 05:52:05 GMT,2016-01-01 05:52:11 GMT,2016-01-01 05:52:16 GMT,2016 -01-01 05:40:05 GMT,2016-01-01 06:12:13 GMT,2016-01-01 07:12:26 GMT)),
EndTime = as。 POSIXct(c(2016-01-01 05:52:10 GMT,2016-01-01 05:52:16 GMT,2016-01-01 05:52:30 GMT, -01 05:46:05 GMT,2016-01-01 06:12:25 GMT,2016-01-01 08:00:00 GMT))
)
dat
dat%>%group_by(ID)%>%mutate(NextStartTime = lead(StartTime))duration = as.numeric(difftime(NextStartTime,EndTime ,units ='s')))
ID NumberInSequence StartTime EndTime NextStartTime持续时间
< dbl> < DBL> <时间> <时间> <时间> < DBL>
1 1 1年1月1日05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11 1
2 1 3 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
3 1 4 2016-01-01 05:52:16 2016-01-01 05: 52:30< NA> NA
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
6 3 2 2016-01- 01 07:12:26 2016-01-01 08:00:00< NA> NA
这是非常接近正确的答案,但如果有一个缺少的ID,它仍然计算,是
例如 - 查看ID = 1,有3个序列号为1,3和4的条目。序列中没有#2。它缺少,因此ID = 1的NextStartTime和持续时间,序列号为1的数字应为NA NOT 05:52:11和1。
有没有办法施加这个逻辑?
谢谢。
两个选项:
tidyr :: complete
一个选项是使用 tidyr :: complete
填写缺失的行,并使用以前的方法。
下行:您新增的主要是 NA
行。尽管如此,您可以通过仔细的过滤器
呼叫来省略它们。
上升:很容易编写和理解,并保留原来的逻辑。
library(tidyverse)
dat%>%group_by ID)%>%
complete(NumberInSequence = seq(max(NumberInSequence)))%>%
mutate(NextStartTime = lead(StartTime),
持续时间= as.numeric (NextStartTime,EndTime,units ='s')))
##来源:本地数据框[7 x 6]
##组:ID [3]
# #
## ID NumberInSequence StartTime EndTime NextStartTime持续时间
##< dbl> < DBL> < DTTM> < DTTM> < DTTM> < DBL>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10< NA> NA
## 2 1 2< NA> < NA> 2016-01-01 05:52:11 NA
## 3 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
## 4 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30< NA> NA
## 5 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA
## 6 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
## 7 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00< NA> NA
子集 lead StartTime)
with ifelse
ifelse
不方便剥离属性,所以你不能做 ifelse(lead(StartTime)== NumberInSequence + 1,lead(StartTime),NA)
而不会重新生成结果整数回到POSIXct,这是一个麻烦。相反,如果不符合 ifelse
,传递 NA
更容易,因此向量索引返回 NA
而不是任何东西。
下行:为了保持类型,写得很好。
Upside :不添加其他行。
dat%> ;%group_by(ID)%>%
mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence)==(NumberInSequence + 1),TRUE,NA)],
duration = difftime (NextStartTime,EndTime,units ='s')
##来源:本地数据帧[6 x 6]
##组:ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime持续时间
##< dbl> < DBL> < DTTM> < DTTM> < DTTM> <时间>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10< NA> NA secs
## 2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0秒
## 3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30< NA> NA secs
## 4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05< NA> NA secs
## 5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601秒
## 6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00< NA> NA secs
I have this data frame, dat, and dplyr is used to add the "NextStatTime" field which is the start time after the End time for an ID and "Duration" which is the time from the End Time to the next start time for an ID.
The data looks like this:
dat = data.frame(ID= c(1,1,1,2,3,3),
NumberInSequence= c(1,3,4,1,1,2),
StartTime = as.POSIXct(c("2016-01-01 05:52:05 GMT","2016-01-01 05:52:11 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:40:05 GMT","2016-01-01 06:12:13 GMT","2016-01-01 07:12:26 GMT")) ,
EndTime = as.POSIXct(c("2016-01-01 05:52:10 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:52:30 GMT","2016-01-01 05:46:05 GMT","2016-01-01 06:12:25 GMT","2016-01-01 08:00:00 GMT") )
)
dat
dat %>% group_by(ID) %>% mutate(NextStartTime = lead(StartTime), duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))
ID NumberInSequence StartTime EndTime NextStartTime duration
<dbl> <dbl> <time> <time> <time> <dbl>
1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11 1
2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA
That is very close to the right answer but if there is a missing ID it still calculates and is misleading.
For example - look at ID= 1 there are 3 entries with sequence numbers 1,3 and 4. There is no #2 in the sequence. It is missing so the NextStartTime and Duration for ID = 1 and Number in sequence = 1 should be NA NOT 05:52:11 and 1.
Is there a way to impose this logic?
Thank you.
Two options:
tidyr::complete
One option is to use tidyr::complete
to fill in the missing rows, and use the previous method.
Downside: You get new mostly NA
rows added. You could omit them after the fact with a careful filter
call, though.
Upside: It's easy to write and understand, and preserves the original logic.
library(tidyverse)
dat %>% group_by(ID) %>%
complete(NumberInSequence = seq(max(NumberInSequence))) %>%
mutate(NextStartTime = lead(StartTime),
Duration = as.numeric(difftime(NextStartTime, EndTime, units = 's')))
## Source: local data frame [7 x 6]
## Groups: ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime Duration
## <dbl> <dbl> <dttm> <dttm> <dttm> <dbl>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 <NA> NA
## 2 1 2 <NA> <NA> 2016-01-01 05:52:11 NA
## 3 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0
## 4 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA
## 5 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA
## 6 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601
## 7 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA
subset lead(StartTime)
with ifelse
ifelse
inconveniently strips attributes, so you can't do ifelse(lead(StartTime) == NumberInSequence + 1, lead(StartTime), NA)
without reconverting the resulting integer back to POSIXct, which is a hassle. Instead, it's easier to subset with ifelse
, passing an NA
if it's not a match, so the vector indexed returns NA
instead of nothing.
Downside: It's finicky to write in order to keep types.
Upside: No additional rows are added.
dat %>% group_by(ID) %>%
mutate(NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)],
duration = difftime(NextStartTime, EndTime, units = 's'))
## Source: local data frame [6 x 6]
## Groups: ID [3]
##
## ID NumberInSequence StartTime EndTime NextStartTime duration
## <dbl> <dbl> <dttm> <dttm> <dttm> <time>
## 1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 <NA> NA secs
## 2 1 3 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16 0 secs
## 3 1 4 2016-01-01 05:52:16 2016-01-01 05:52:30 <NA> NA secs
## 4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 <NA> NA secs
## 5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
## 6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 <NA> NA secs
这篇关于使用铅笔,但有一些约束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!