R:使用分组计算过去和未来特定事件的发生次数 [英] R: calculate the number of occurrences of a specific event in the past AND future with groupings

查看:140
本文介绍了R:使用分组计算过去和未来特定事件的发生次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是一个问题的修改,我发布这里,其中我在不同的天出现了特定类型,但这次他们被分配给多个用户,例如:

this question is a modification of a problem I posted here where I have occurrences of a specific type on different days, but this time they are assigned to multiple users, for example:

df = data.frame(user_id = c(rep(1:2, each=5)),
            cancelled_order = c(rep(c(0,1,1,0,0), 2)),
            order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23',  '2015-03-23',
                                   '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21',  '2015-03-26')))


user_id cancelled_order order_date
      1               0 2015-01-28
      1               1 2015-01-31
      1               1 2015-02-08
      1               0 2015-02-23
      1               0 2015-03-23
      2               0 2015-01-25
      2               1 2015-01-28
      2               1 2015-02-06
      2               0 2015-02-21
      2               0 2015-03-26

我想计算

1)每个客户取消订单的数量将在接下来的x天内有(例如7,14),排除当前

1) the number of cancelled orders that each customer is going to have in the next x days (e.g. 7, 14), excluding the current one and

1)每个客户在过去x天(例如7,14),不包括当前的

1) the number of cancelled orders that each customer had in the past x days (e.g. 7, 14) , excluding the current one.

所需的输出如下所示:

solution
user_id cancelled_order order_date plus14 minus14
      1               0 2015-01-28      2       0
      1               1 2015-01-31      1       0
      1               1 2015-02-08      0       1
      1               0 2015-02-23      0       0
      1               0 2015-03-23      0       0
      2               0 2015-01-25      2       0
      2               1 2015-01-28      1       0
      2               1 2015-02-06      0       1
      2               0 2015-02-21      0       0
      2               0 2015-03-26      0       0

完全适合此目的的解决方案由@ joel.wilson使用 data.table

The solution that is perfectly fit for this purpose was presented by @joel.wilson using data.table

library(data.table)
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) := 
        lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date, 
                                                 order_date, 
                                                 order_date + i, # this part can be changed to reflect the past date ranges
                                                 incbounds = FALSE)])),
        by = order_date]

但是,它不考虑按 user_id 分组。当我试图通过添加 by = c(user_id,order_date) by = list(user_id, order_date),它没有工作。它似乎是一些非常基本的,任何提示如何绕过这个细节?

However, it does not take into account grouping by user_id. When I tried to modify the formula by adding this grouping as by = c("user_id", "order_date") or by = list(user_id, order_date), it did not work. It seems it is something very basic, any hints on how to get around this detail?

此外,请记住,我在一个解决方案后工作,即使它不是基于上述代码或 data.table

Also, keep in mind that I'm after a solution that works, even if it is not based on the above code or data.table at all!

谢谢!

推荐答案

p>这里有一种方法:

Here's one way:

library(data.table)
orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date))

vec = list(minus = 14L, plus = 14L)
orderDT[, c("dplus", "dminus") := .(
    orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
    ,
    orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N
)]


    id completed          d dplus dminus
 1:  1      TRUE 2015-01-28     2      0
 2:  1     FALSE 2015-01-31     1      0
 3:  1     FALSE 2015-02-08     0      1
 4:  1      TRUE 2015-02-23     0      0
 5:  1      TRUE 2015-03-23     0      0
 6:  2      TRUE 2015-01-25     2      0
 7:  2     FALSE 2015-01-28     1      0
 8:  2     FALSE 2015-02-06     0      1
 9:  2      TRUE 2015-02-21     0      0
10:  2      TRUE 2015-03-26     0      0

(我发现OP的列名称繁琐,他们。)

(I found OP's column names cumbersome and so shortened them.)

工作原理

每个列都可以自己运行,如

Each of the columns can be run on its own, like

orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N

这可以通过简化来细分为:

And this can be broken down into steps by simplifying:

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom), 
  .N, 
  by=.EACHI]$N
# original version

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom), 
  .N, 
  by=.EACHI] 
# don't extract the N column of counts

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom)]
# don't create the N column of counts

orderDT[!(completed)]
# don't do the join

orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)]
# see the second table used in the join

这使用非等值连接,采用不等式来定义日期范围。有关更多详细信息,请参阅通过键入?data.table 找到的文档页。

This uses a "non-equi" join, taking inequalities to define the date ranges. For more details, see the documentation page found by typing ?data.table.

这篇关于R:使用分组计算过去和未来特定事件的发生次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆