R:计算特定事件之间的时差 [英] R: calculate time difference between specific events

查看:96
本文介绍了R:计算特定事件之间的时差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集:

df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
                  complete_order = c(rep(c(1,0,0,1), 2)),
                  order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))  

library(lubridate)
df$order_date = as_date(df$order_date)

user_id complete_order order_date
      1              1 2015-01-28
      1              0 2015-01-31
      1              0 2015-02-08
      1              1 2015-02-23
      2              1 2015-01-25
      2              0 2015-01-28
      2              0 2015-02-06
      2              1 2015-02-21

我正在尝试计算每个用户仅完成的订单之间的天数差异.理想的结果如下所示:

I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:

user_id complete_order order_date complete_order_time_diff
<fctr>         <fctr>     <date>              <time>
   1              1    2015-01-28             NA days
   1              0    2015-01-31              3 days
   1              0    2015-02-08             11 days
   1              1    2015-02-23             26 days
   2              1    2015-01-25             NA days
   2              0    2015-01-28              3 days
   2              0    2015-02-06             12 days
   2              1    2015-02-21             27 days

当我尝试此解决方案时:

when I try this solution:

library(dplyr)

df %>% 
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))

它返回错误:

Error: incompatible size (3), expecting 4 (the group size) or 1

任何对此的帮助都将非常有用,谢谢!

Any help with this will be great, thank you!

推荐答案

似乎您正在寻找每个订单与最后一个已完成订单之间的距离.具有二元向量xc(NA, cummax(x * seq_along(x))[-length(x)])给出在每个元素之前看到的最后一个"1"的索引.然后,从该相应索引处的"order_date"中减去"order_date"的每个元素,即可得到所需的输出.例如

It seems that you're looking for the distance of each order from the last completed one. Having a binary vector, x, c(NA, cummax(x * seq_along(x))[-length(x)]) gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.

set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y, 
      last_x = c(NA, cummax(x * seq_along(x))[-length(x)]), 
      y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
#      x y last_x y_diff
# [1,] 1 3     NA     NA
# [2,] 0 3      1      0
# [3,] 1 5      1      2
# [4,] 0 1      3     -4
# [5,] 0 3      3     -2
# [6,] 1 5      3      0
# [7,] 1 1      6     -4
# [8,] 0 3      7      2
# [9,] 0 4      7      3
#[10,] 1 5      7      4

为方便起见,为方便起见,请先格式化df

On your data, first format df for convenience:

df$order_date = as.Date(df$order_date)
df$complete_order = df$complete_order == "1"  # lose the 'factor'

然后,在group_by之后应用上述方法:

And, then, either apply the above approach after a group_by:

library(dplyr)
df %>% group_by(user_id) %>% 
   mutate(time_diff = order_date - 
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])

,或者尝试在考虑"user_id"更改的索引后避免分组(假设排序为"user_id")的操作:

, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes:

# save variables to vectors and keep a "logical" of when "id" changes
id = df$user_id
id_change = c(TRUE, id[-1] != id[-length(id)])

compl = df$complete_order
dord = df$order_date

# accounting for changes in "id", locate last completed order
i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
is.na(i) = id_change

dord - dord[i]
#Time differences in days
#[1] NA  3 11 26 NA  3 12 27

这篇关于R:计算特定事件之间的时差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆