使用分组计算过去和未来特定事件的发生次数 [英] Calculate the number of occurrences of a specific event in the past AND future with groupings

查看:185
本文介绍了使用分组计算过去和未来特定事件的发生次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是我发布的一个问题的修改 here这里我在不同日子发生特定类型的事件,但这次它们被分配给多个用户,例如:

$ $ $ $ $ $ $ $ $ $ df = data.frame(user_id = c(rep(1:2,each = 5)),
cancelled_order = c(rep(c(0 ,1,1,0,0),2)),
order_date = as.Date(c('2015-01-28','2015-01-31','2015-02-08', '2015-02-23','2015-03-23',
'2015-01-25','2015-01-28','2015-02-06','2015-02-21 ','2015-03-26')))


user_id cancelled_order order_date
1 0 2015-01-28
1 1 2015-01-31
1 1 2015-02-08
1 0 2015-02-23
1 0 2015-03-23
2 0 2015-01-25
2 1 2015-01-28
2 1 2015-02-06
2 0 2015-02-21
2 0 2015-03-26

我想计算一下

1)每个客户将在未来x天内取消的订单数量(例如, 7,14),不包括当前

<1>每个客户已取消订单的数量在过去的x天(例如7,14),排除当前的



所需的输出如下所示:

 解决方案
user_id cancelled_order order_date plus14减去14
1 0 2015-01-28 2 0
1 1 2015-01-31 1 0
1 1 2015-02-08 0 1
1 0 2015-02-23 0 0
1 0 2015-03-23 0 0
2 0 2015-01-25 2 0
2 1 2015-01-28 1 0
2 1 2015-02-06 0 1
2 0 2015-02-21 0 0
2 0 2015-03-26 0 0

解决方案完全符合此目的,由@ joel.wilson使用 data.table

  library(data.table)
vec < - c(14,30)#指定所需范围
setDT(df)[, paste0(x,vec):=
lapply(vec,function(i)sum(df $ cancelled_order [between(df $ order_date,
order_date,
order_date + i,#this部分可以改变以反映过去的日期范围
incbounds = FALSE)])),
by = order_date]

然而,它并没有考虑到 user_id 的分组。当我尝试通过将此分组添加为 by = c(user_id,order_date)或 by = list(user_id, order_date),它不起作用。这似乎是非常基本的东西,关于如何解决这个细节的任何提示?

另外,请记住,即使它不是基于上述代码或 data.table



谢谢!

解决方案

以下是一种方法:

  library(data.table)
orderDT = with(df,data.table( id = user_id,completed =!cancelled_order,d = order_date))

vec = list(minus = 14L,plus = 14L)
orderDT [,c(dplus,dminus ):=。(
orderDT [!(completed)] [orderDT [,。(id,d_plus = d + vec $ plus,d_tom = d + 1L)],on =。(id,d < d_plus,d> = d_tom),.N,by = .EACHI] $ N

orderDT [!(completed)] [orderDT [,。(id,d_minus = d - vec $ minus ,d_yest = d-1L)],on =。(id,d> = d_minus,d <= d_yest),.N,by = .EACHI] $ N
)]


id完成d dplus dminus
1:1 TRUE 2015-01-28 2 0
2:1 FALSE 2015-01-31 1 0
3:1 FALSE 2015 -02-08 0 1
4:1 TRUE 2015-02-23 0 0
5:1 TRUE 2015-03-23 0 0
6:2 TRUE 2015-01-25 2 0
7:2 FALSE 2015 -01-28 1 0
8:2 FALSE 2015-02-06 0 1
9:2 TRUE 2015-02-21 0 0
10:2 TRUE 2015-03-26 0 0

(我发现OP的列名很麻烦,所以缩短了它们。)




运作方式

每列可以像

  orderDT [!(completed)] [orderDT [,。(id,d_plus = d + vec) $ plus $ d $ t $ d $ 1 code code code code code $ d $ > 



这可以通过简化步骤分解为:

  orderDT [!(completed)] [
orderDT [,。(id,d_plus = d + vec $ plus,d_tom = d + 1L)],
on =。(id,d < = d_plus,d> = d_tom),
.N,
by = .EACHI] $ N
#origina l版本

orderDT [!(已完成)] [
orderDT [,。(id,d_plus = d + vec $ plus,d_tom = d + 1L)],
on =。(id,d< = d_plus,d> = d_tom),
.N,
by = .EACHI]
#不提取N列计数

orderDT [!(completed)] [
orderDT [,。(id,d_plus = d + vec $ plus,d_tom = d + 1L)],
on =。(id ,d <= d_plus,d> = d_tom)]
#不创建N列的计数

orderDT [!(completed)]
#don' t做连接

orderDT [,。(id,d_plus = d + vec $ plus,d_tom = d + 1L)]
#查看连接中使用的第二个表

这使用非平等连接,以不等式定义日期范围。有关更多详细信息,请参阅通过键入?data.table 找到的文档页面。


this question is a modification of a problem I posted here where I have occurrences of a specific type on different days, but this time they are assigned to multiple users, for example:

df = data.frame(user_id = c(rep(1:2, each=5)),
            cancelled_order = c(rep(c(0,1,1,0,0), 2)),
            order_date = as.Date(c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23',  '2015-03-23',
                                   '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21',  '2015-03-26')))


user_id cancelled_order order_date
      1               0 2015-01-28
      1               1 2015-01-31
      1               1 2015-02-08
      1               0 2015-02-23
      1               0 2015-03-23
      2               0 2015-01-25
      2               1 2015-01-28
      2               1 2015-02-06
      2               0 2015-02-21
      2               0 2015-03-26

I'd like to calculate

1) the number of cancelled orders that each customer is going to have in the next x days (e.g. 7, 14), excluding the current one and

1) the number of cancelled orders that each customer had in the past x days (e.g. 7, 14) , excluding the current one.

The desired output would look like this:

solution
user_id cancelled_order order_date plus14 minus14
      1               0 2015-01-28      2       0
      1               1 2015-01-31      1       0
      1               1 2015-02-08      0       1
      1               0 2015-02-23      0       0
      1               0 2015-03-23      0       0
      2               0 2015-01-25      2       0
      2               1 2015-01-28      1       0
      2               1 2015-02-06      0       1
      2               0 2015-02-21      0       0
      2               0 2015-03-26      0       0

The solution that is perfectly fit for this purpose was presented by @joel.wilson using data.table

library(data.table)
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) := 
        lapply(vec, function(i) sum(df$cancelled_order[between(df$order_date, 
                                                 order_date, 
                                                 order_date + i, # this part can be changed to reflect the past date ranges
                                                 incbounds = FALSE)])),
        by = order_date]

However, it does not take into account grouping by user_id. When I tried to modify the formula by adding this grouping as by = c("user_id", "order_date") or by = list(user_id, order_date), it did not work. It seems it is something very basic, any hints on how to get around this detail?

Also, keep in mind that I'm after a solution that works, even if it is not based on the above code or data.table at all!

Thanks!

解决方案

Here's one way:

library(data.table)
orderDT = with(df, data.table(id = user_id, completed = !cancelled_order, d = order_date))

vec = list(minus = 14L, plus = 14L)
orderDT[, c("dplus", "dminus") := .(
    orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N
    ,
    orderDT[!(completed)][orderDT[, .(id, d_minus = d - vec$minus, d_yest = d - 1L)], on=.(id, d >= d_minus, d <= d_yest), .N, by=.EACHI]$N
)]


    id completed          d dplus dminus
 1:  1      TRUE 2015-01-28     2      0
 2:  1     FALSE 2015-01-31     1      0
 3:  1     FALSE 2015-02-08     0      1
 4:  1      TRUE 2015-02-23     0      0
 5:  1      TRUE 2015-03-23     0      0
 6:  2      TRUE 2015-01-25     2      0
 7:  2     FALSE 2015-01-28     1      0
 8:  2     FALSE 2015-02-06     0      1
 9:  2      TRUE 2015-02-21     0      0
10:  2      TRUE 2015-03-26     0      0

(I found OP's column names cumbersome and so shortened them.)


How it works

Each of the columns can be run on its own, like

orderDT[!(completed)][orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], on=.(id, d <= d_plus, d >= d_tom), .N, by=.EACHI]$N

And this can be broken down into steps by simplifying:

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom), 
  .N, 
  by=.EACHI]$N
# original version

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom), 
  .N, 
  by=.EACHI] 
# don't extract the N column of counts

orderDT[!(completed)][
  orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)], 
  on=.(id, d <= d_plus, d >= d_tom)]
# don't create the N column of counts

orderDT[!(completed)]
# don't do the join

orderDT[, .(id, d_plus = d + vec$plus, d_tom = d + 1L)]
# see the second table used in the join

This uses a "non-equi" join, taking inequalities to define the date ranges. For more details, see the documentation page found by typing ?data.table.

这篇关于使用分组计算过去和未来特定事件的发生次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆