根据日期对数据表中的前行进行计数 [英] Counting previous rows in a data table based on date

查看:96
本文介绍了根据日期对数据表中的前行进行计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(很抱歉,如果这里的某些术语不正确-我来自SQL背景,我只是进入R世界)

(Apologies if some of the terminology here is off - I come from a SQL background and I am only just getting into the R world)

我有数据带有一系列按日期排序的条目的表。数据表中的字段之一是分组值,一个是时间值。
按组对数据进行排序(或键入键-我是R的新手,仍然不确定该如何区分)然后,我想对每一行的日期进行计数,该组中的行中有多少行在当前行之前

I have a data table with a series of date-ordered entries. One of the fields in the data table is a grouping value, and one is a time value. With the data ordered (or keyed - I'm new to R and still not sure of the difference) by the group THEN the date, I want to count, for each row HOW MANY rows in this group precede the current row (including the current), within a given timespan.

这是我尝试使用Loblolly数据集进行的简化示例:

Here's a simplified example of what I'm trying to do, using the Loblolly data set:

准备示例数据:

library(lubridate)
library(zoo)
library(data.table)
DT = as.data.table(Loblolly)
DT[,rd := Sys.time() + years(age)]
setkey(DT,Seed,rd)

现在我们有一个由Seed排序的数据表(该组)和rd(我的日期列)。我有一个解决方案,它将根据10年的间隔产生我的计数值(ct):

Now we have a data table ordered by Seed (the group) and rd (my date column). I have a solution which will produce my count value (ct) based on an interval of 10 years:

DT[,.ct:=mapply(function(x,y) DT[(rd>x-years(10) & rd<=x &Seed==y),.N],DT$rd,DT$Seed)]

这将在此示例数据集中产生所需的结果:

This produces the desired result in this example dataset:

    height age Seed                  rd  ct
 1:   3.93   3  329 2019-03-01 13:38:00   1
 2:   9.34   5  329 2021-03-01 13:38:00   2
 3:  26.08  10  329 2026-03-01 13:38:00   3
 4:  37.79  15  329 2031-03-01 13:38:00   2
 5:  48.31  20  329 2036-03-01 13:38:00   2
 6:  56.43  25  329 2041-03-01 13:38:00   2
 7:   4.12   3  327 2019-03-01 13:38:00   1
 8:   9.92   5  327 2021-03-01 13:38:00   2
 9:  26.54  10  327 2026-03-01 13:38:00   3
10:  37.82  15  327 2031-03-01 13:38:00   2
...
...

但是,我需要扩大规模以处理超过500万条记录,大约有10,000个小组,并且在那儿花了太长的时间。

However, I need to scale this up to work on upwards of 5 million records, across approx 10,000 groups, and it takes an unfeasibly long time to run there. Is there a quicker and less clumsy method to do what I'm trying to do?

推荐答案

这里是使用<$的可能解决方案吗? c $ c> data.table :: foverlaps 。这里的想法是首先加入 {Sys.time()-years(10),Sys.time()+ years(age)} 的整个范围。然后,仅计算差异小于< = 10年的实例。

Here's a possible solution using data.table::foverlaps. The idea here is to first join over the whole range of {Sys.time() - years(10), Sys.time() + years(age)}. Then, count only the instances when the difference is less <= 10 years.

DT <- as.data.table(Loblolly)
DT[, c("rd", "rd2") := Sys.time() + years(age)] # create identical columns so foverlaps will work
setkey(DT, Seed, rd, rd2) # key by all for same reason
DT2 <- DT[, .(Seed, rd = rd - years(10), rd2, indx = .I)] # create minum range, create index to store row number
DT[, ct := foverlaps(DT, DT2)[i.rd > rd, .N, by = indx]$N] # run foverlaps, subset by condition and count
head(DT, 10)
#     height age Seed                  rd                 rd2 ct
#  1:   3.93   3  329 2019-03-01 22:59:02 2019-03-01 22:59:02  1
#  2:   9.34   5  329 2021-03-01 22:59:02 2021-03-01 22:59:02  2
#  3:  26.08  10  329 2026-03-01 22:59:02 2026-03-01 22:59:02  3
#  4:  37.79  15  329 2031-03-01 22:59:02 2031-03-01 22:59:02  2
#  5:  48.31  20  329 2036-03-01 22:59:02 2036-03-01 22:59:02  2
#  6:  56.43  25  329 2041-03-01 22:59:02 2041-03-01 22:59:02  2
#  7:   4.12   3  327 2019-03-01 22:59:02 2019-03-01 22:59:02  1
#  8:   9.92   5  327 2021-03-01 22:59:02 2021-03-01 22:59:02  2
#  9:  26.54  10  327 2026-03-01 22:59:02 2026-03-01 22:59:02  3
# 10:  37.82  15  327 2031-03-01 22:59:02 2031-03-01 22:59:02  2






编辑17/3 / 2017年:

使用data.table v1.10.4 +,现在可以使用非uqui联接加上 by = .EACHI 。基本上,您可以使用> = < = 进行联接,而不仅仅是完全联接并运行计算在加入时(为了避免像您的情况那样进行笛卡尔加入)并仅返回最终结果。因此,在您的特定情况下,您可以

Using data.table v1.10.4+ you can now use non-uqui joins combined with by = .EACHI. Which basically allows you to both join using >= and <= rather than just exact join and also run computations while joining (in order to avoid Cartesian joins like in your case) and return just the final result. So in your specific case you can just do

DT[, rd10 := rd - years(10)]
DT[, ct := DT[DT, .N, on = .(Seed, rd <= rd, rd > rd10), by = .EACHI]$N]

这篇关于根据日期对数据表中的前行进行计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆