R - 使用data.table来跨多个行和列有效地测试滚动条件 [英] R - Using data.table to efficiently test rolling conditions across multiple rows and columns

查看:116
本文介绍了R - 使用data.table来跨多个行和列有效地测试滚动条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在data.table中测试看起来像这个可重现示例的各种条件。

  set.seed (17)
year < - 1980 + rnbinom(10000,3,0.35)
event< - rep(LETTERS,length.out = 10000)
z< - as.integer ·runif(10000,min = 0,max = 10)
dt setkey(dt,event,year)
dt& - dt [,sum(z),by = c(event,year)]

V1 (出现在最后一个命令中)表示事件发生的计数。



一个有序数组和我需要执行各种功能就可以了。以下是一些示例:


  1. 如何计算前10年发生的滚动总和每个事件?因此,对于 A 1990 ,所需的输出为1,452 (1980年至1989年)。对于 H 2012 产出为11 ,因为在2002年至2011年期间,仅发生11次(2002年为3次,2007年为3次,2010年为5次)。对于 A 1983 输出为 NA


  2. p>如何检查事件是否发生在至少12年前的15年?因此,对于 A 1997 ,我们可以看到,事件发生在15年前的12年以上(1982 - 1996年,除1996年以外的每年),因此达到标准。但是,对于 A 2001 ,我们发现该事件仅发生在15年前(1986 - 2000年)中的11个,它不会在1996,1998,1999和2000年发生)遇见。此处所需的输出将是离散1(标准满足)或0(不符合标准)



$ b b

理想情况下,代码将使得不仅可以在 data.table $ c>,但也缺少1980年和2013年之间的那些。因此,对于 K 2005 ,我们可以将Q1的结果计算为25(13 + 5 + 3 + 3 + 2)(感谢@Arun指向前一个错误)。对于Q2,我们看到事件不发生在1999,2000,2001,2003和2004年,因此标准至少在15年中的12年不满足。此外,可能在data.table中存在事件 - 年组合,但是V1具有值0(参见行18,A 2001)。理想情况下,这样的零次出现将被视为不出现(例如通过删除V1为零的所有行)。



我知道发布两个问题不常见,但我感觉他们属于一起,真的涉及类似的问题。希望有人能提出一些建议。



非常感谢,



Simon

解决方案

对于您的第一个问题:



在数据集(以及你所要求的两个点下)。想法是首先生成 event year 的所有组合 - 即使是在数据集中不存在的组合。这可以通过函数 CJ (用于交叉连接)来完成。这将为每个事件创建所有

  setkey(dt,event,year)
d1 = CJ(event = unique(dt $ event),year = min(dt $ year):max年))

现在,我们加入 dt 填充 V1 的缺失值。

  d1 = dt [d1] 

包含 event year 的所有组合的数据集。从这里,我们现在要找到一种方式来执行滚动总和。为此,我们再次创建另一个数据集,其中包含每年的所有前10年,如下:

  window_size = 10L 
d2 = d1 [,list(window = seq(year-window_size,year-1L,by = 1L)),by =event,year]

对于每个事件,年,我们创建一个新列 window



现在,我们要做的就是设置列并执行 join 以获取相应的V1值。

  setkey(d2,event,window)##注意这里的join是在event,window上
setkey(d1,event,year)

ans = d1 [d2]

现在,每个事件,窗口组合的值为V1。所有我们要做的是通过事件,年1(年1以前是年,并在 ans 窗口)。这里,我们关注的条件是,如果任何年份< 1980,那么总和应为NA。这是通过使用 TRUE |的小黑客来完成的NA = TRUE FALSE | NA = NA

  q1 = ans [,sum(V1,na.rm = *(!any(year< 1980)| NA),by =event,year.1] 

q1 [event ==K& year.1 ==2005]
#event year.1 V1
#1:K 2005 25






对于您的第二个问题:



使用 window_size = 15L 而不是10L,并起床,直到 ans 。然后,我们可以做:

  q2 = ans [!is.na(V1)] [,.N,by =事件,年1] 

q2 [event ==A& year.1 == 1997]
#event year.1 N
#1:A 1997 14

这是正确的,因为 dt 有从1982-1995的所有年份,并且1996缺少,因此不计数=> N = 14 ,因为它应该是。


I am trying to test a variety of conditions in a data.table that looks like this reproducible example

 set.seed(17)
 year <- 1980 + rnbinom(10000,3,0.35)
 event <- rep(LETTERS, length.out=10000)
 z <- as.integer(runif(10000,min = 0, max = 10))
 dt <- data.table(event,year,z)
 setkey(dt, event,year)
 dt <- dt[,sum(z), by=c("event","year")]

V1 (which emerges from the last command) represents a count of event occurences.

So the data table is an ordered array and I need to execute a variety of functions on it. Here are some examples:

  1. How do I calculate a rolling sum (or rolling mean) of the occurences in 10 prior years for each event? So for A 1990 the desired output is 1,452 (between 1980 and 1989). For H 2012, the output is 11 because between 2002 and 2011 there are only 11 occurences (3 in 2002, 3 in 2007, and 5 in 2010). For A 1983 the output is NA

  2. How can I check whether an event occurs in at least 12 out of 15 prior years? So for A 1997 we can see that the event occurred in more than 12 years in the 15 years prior (1982 - 1996, it happened in every year besides 1996) thus criterium met. However, for A 2001 we see that the event only occurs in 11 of 15 prior years (1986 - 2000), it does not happen in 1996,1998,1999,and 2000) criterium not met. The desired output here would be a discrete 1 (criterium met) or 0 (criterium not met)

Ideally the code would enable the calculation of both 1 and 2 not only for years that occur in the data.table but also for those between 1980 and 2013 that are missing. So for K 2005, we can calculate the outcome for Q1 as 25 (13 + 5 + 3 + 3 + 2) (thanks @Arun for pointing the former error out). For Q2, we see the event does not occur in 1999,2000,2001,2003, and 2004 hence the criterium "at least in 12 out of 15 years" is not met. Also, it is possible that the event-year combination exists in the data.table but that V1 has value 0 (see row 18, A 2001). Ideally, such zero occurences would be treated as non-occurences (e.g. by deleting all rows for which V1 is zero).

I know it's uncommon to post two questions but I feel they belong together and really relate to similar problems. Hope someone can make some suggestions.

Thanks a lot,

Simon

解决方案

For your first question:

This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). The idea is to first generate all combinations of event and year - even the ones which doesn't exist in the dataset. This can be accomplished by the function CJ (for crossjoin). This'll, for each event, create all year.

setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))

Now, we join back with dt to fill the missing values for V1 with NA.

d1 = dt[d1]

Now we've a dataset with all combinations of event and year. From here, we've to now find a way to perform the rolling sum. For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows:

window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]

For each "event,year", we create a new column window, that'll generate the previous 10 years.

Now, all we've to do is to set the key columns appropriately and perform a join to get the corresponding "V1" values.

setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)

ans = d1[d2]

Now, we've the values of "V1" for each "event,window" combination. All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in ans was previously "window"). Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. This is done by using a small hack that TRUE | NA = TRUE and FALSE | NA = NA.

q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]

q1[event == "K" & year.1 == "2005"]
#    event year.1 V1
# 1:     K   2005 25


For your second question:

Repeat the same as above with window_size = 15L instead of 10L and get up until ans. Then, we can do:

q2 = ans[!is.na(V1)][, .N, by="event,year.1"]

q2[event == "A" & year.1 == 1997]
#    event year.1  N
# 1:     A   1997 14

This is correct because dt has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14, as it should be.

这篇关于R - 使用data.table来跨多个行和列有效地测试滚动条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆