带有时间窗口的非常规观测数据表累积统计 [英] data.table cumulative stats of irregular observations with time window

查看:126
本文介绍了带有时间窗口的非常规观测数据表累积统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些交易记录,例如:

I have some transactional records, like the following:

library(data.table)
customers      <- 1:75
purchase_dates <- seq( as.Date('2016-01-01'),
                       as.Date('2018-12-31'), 
                       by=1 )
n <- 500L

set.seed(1)

# Assume the data are already ordered and 1 row per cust_id/purch_dt
df <- data.table( cust_id   = sample(customers, n, replace=TRUE),
                  purch_dt  = sample(purchase_dates, n, replace=TRUE),
                  purch_amt = sample(500:50000, n, replace=TRUE)/100
                  )[, .(purch_amt = sum(purch_amt)), 
                      keyby=.(cust_id, purch_dt) ]
df
# cust_id   purch_dt purch_amt
#       1 2016-03-20     69.65
#       1 2016-05-17    413.60
#       1 2016-12-25    357.18
#       1 2017-03-20    256.21
#       2 2016-05-26     49.14
#       2 2018-05-31    261.87
#       2 2018-12-27    293.28
#       3 2016-12-10    204.12
#       3 2018-09-21      8.70

我想知道在365天之前的窗口内(例如,在 d-365 通过 d-1 进行日期为 d 的交易)。

I would like to know the prior transaction count and total amount, within a 365-day prior window (i.e., at d-365 through d-1 for a transaction on date d).

我曾想过使用滚动连接,但是最多可以匹配一次以前的购买,并且可以进行多次购买。

I thought of using a rolling join, but that would match to at most one prior purchase, and there could be multiple purchases.

我能够使用带有日期过滤器的笛卡尔自连接来获得所需的结果(请参见下面的答案),但这不是一种非常节省内存的方法。

I was able to get the desired result using a Cartesian self-join with a date filter (see answer below), but that's not a very memory-efficient approach.

所需的输出:

 cust_id   purch_dt prior_purch_cnt prior_purch_amt purch_amt
       1 2016-03-20               0            0.00     69.65
       1 2016-05-17               1           69.65    413.60
       1 2016-12-25               2          483.25    357.18
       1 2017-03-20               3          840.43    256.21
       2 2016-05-26               0            0.00     49.14
       2 2018-05-31               0            0.00    261.87
       2 2018-12-27               1          261.87    293.28
       3 2016-12-10               0            0.00    204.12
       3 2018-09-21               0            0.00      8.70


推荐答案


我想知道在365天之前的窗口内(例如,在 d-365中)的先前交易计数和总金额 d-1 d )。

我认为惯用的方式是:

df[, c("ppn", "ppa") := 
  df[.(cust_id = cust_id, d_dn = purch_dt-365, d_up = purch_dt), 
    on=.(cust_id, purch_dt >= d_dn, purch_dt < d_up), 
    .(.N, sum(purch_amt, na.rm=TRUE))
  , by=.EACHI][, .(N, V2)]
]

     cust_id   purch_dt purch_amt ppn    ppa
  1:       1 2016-03-20     69.65   0   0.00
  2:       1 2016-05-17    413.60   1  69.65
  3:       1 2016-12-25    357.18   2 483.25
  4:       1 2017-03-20    256.21   3 840.43
  5:       2 2016-05-26     49.14   0   0.00
 ---                                        
494:      75 2018-01-12    381.24   2 201.04
495:      75 2018-04-01     65.83   3 582.28
496:      75 2018-06-17    170.30   4 648.11
497:      75 2018-07-22     60.49   5 818.41
498:      75 2018-10-10     66.12   4 677.86

这是非等额参加。

这篇关于带有时间窗口的非常规观测数据表累积统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆