如何使用data.table在不均匀间隔的数据上计算R中的滚动统计 [英] How to Calculate a rolling statistic in R using data.table on unevenly spaced data

查看:106
本文介绍了如何使用data.table在不均匀间隔的数据上计算R中的滚动统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集由两个ID变量(一个嵌套在另一个)和日期索引,我希望计算滚动统计在这些数据。

I have a data set indexed by two ID variables (one nested in the other) and date, and I wish to calculate a rolling statistic in this data.

我的实际数据集很大(约200 mil行),我喜欢使用data.table的其他任务的速度增益...但我不能弄清楚在这个问题中最佳地使用data.table(即,利用二进制搜索和避免向量扫描)的方式。

My real dataset is large (~200 mil rows), and I have enjoyed the speed gains using data.table on other tasks...but I cannot figure out a way to use data.table optimally(i.e. leverage binary search and avoid a vector scan) in this problem.

示例数据:

set.seed(3)
dt1 <- 
 data.table(id1=c(rep("a",124),rep("b",124)),
            id2=c(rep("x",62),rep("y",62)),
            date=seq(as.Date("2012-05-01"),as.Date("2012-07-01"),"days"),
            var1=rpois(124,14),
            var2=rpois(124,3))
setkey(dt1,id1,id2,date)
# create uneven time spacing
dt1 <- dt1[-c(5,10,36,46,58)]

我的最终目标是为id1 / id2中的每一天计算一个滚动统计,它是:

My end goal is to calculate a "rolling statistic" for each day within id1/id2, which is:

sum(var2)/ sum )

sum(var2)/sum(var1)

包括具有相同id1 / id2组合的所有其他行,以及在该行之前的30天内。

including all other rows with the same id1/id2 combination, and in the 30 days prior to that row.

我不知道这是一个很好的第一步,但是为了清楚起见,这里是代码获取我想要的结果的所有ID在Date = 2012-06- 12:

I am not sure this is a good first step, but for clarity's sake here is the code to get my desired result for the all ID's on Date=2012-06-12:

dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31,
    list("newstat"=sum(var1)/sum(var2),
         "date"=as.Date("2012-06-12")),by=list(id1,id2)]

   id1 id2 newstat       date
1:   a   x   3.925 2012-06-12
2:   a   y   4.396 2012-06-12
3:   b   x   3.925 2012-06-12
4:   b   y   4.396 2012-06-12

我想到试图对id1和id2做一个笛卡尔自联接,然后将我的结果data.table减少到适当的日期范围(我不知道如何做到一个步)。这给了我想要的结果,但是,我不知道如何做,而不使用逻辑子集作为参数在i的还原步骤,这是不可接受的慢。我想我没有利用data.table的关键点,但不知道如何解决这个...

I thought of trying to do a Cartesian self join on id1 and id2, and then reduce my resulting data.table down to the appropriate date range (I don't know how to do that in one step). This gives me the desired result, however, I am not sure how to do that without using a logical subset as the argument to i on the reduction step, which is unacceptably slow. I think I am failing to leverage the keys of the data.table at that point, but not sure how to solve this...

示例:

dt1[setkey(dt1[,list(id1,id2,"date_grp"=date)],id1,id2),
    list(date_grp,date,var1,var2)][
      # Here comes slow subset
      date<date_grp & date > date_grp-30,
      list("newstat"=sum(var1)/sum(var2)),
      by=list(id1,id2,date_grp)]

结果:

     id1 id2   date_grp newstat
  1:   a   x 2012-05-02  0.4286
  2:   a   x 2012-05-03  0.4000
  3:   a   x 2012-05-04  0.2857
  4:   a   x 2012-05-06  0.2903
  5:   a   x 2012-05-07  0.3056
 ---                           
235:   b   y 2012-06-27  0.2469
236:   b   y 2012-06-28  0.2354
237:   b   y 2012-06-29  0.2323
238:   b   y 2012-06-30  0.2426
239:   b   y 2012-07-01  0.2304


推荐答案

我相信有更好的方法来做到这一点,但你可以做的是避免完全笛卡尔这是通过生成具有连接键的临时表来杀死你的:

I'm sure there is a better way to do this, but one thing you can do is avoid the full cartesian join which is what's killing you by generating an interim table with the join keys:

dt.dates <- dt1[, list(date.join=seq(as.Date(date - 1, origin="1970-01-01"), by="-1 day", len=30)), by=list(date, id1, id2)]

对于每个date-id组,我们现在都生成了允许的连接日期列表。现在我们加入数据并计算我们的指标。

For each date-id group, we've now generated the list of allowable join dates. Now we join back to the data and compute our metric.

setkey(dt.dates, date.join, id1, id2)
setkey(dt1,date,id1,id2)
dt.dates[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)]

我无法复制6/12的结果,但我认为我们有种子问题。比较:

I couldn't replicate your result for 6/12, but I think we have a seeding issue. Compare:

> dt.date.join[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)][date=="2012-06-12"]
   id1 id2       date       V1
1:   a   x 2012-06-12 3.630631
2:   a   y 2012-06-12 4.434783
3:   b   x 2012-06-12 3.634783
4:   b   y 2012-06-12 4.434783
> dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31, list("newstat"=sum(var1)/sum(var2), "date"=as.Date("2012-06-12")),by=list(id1,id2)]
   id1 id2  newstat       date
1:   a   x 3.630631 2012-06-12
2:   a   y 4.434783 2012-06-12
3:   b   x 3.634783 2012-06-12
4:   b   y 4.434783 2012-06-12

基本相同

这篇关于如何使用data.table在不均匀间隔的数据上计算R中的滚动统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆