R - 使用data.table来跨多个行和列有效地测试滚动条件 [英] R - Using data.table to efficiently test rolling conditions across multiple rows and columns
问题描述
我试图在data.table中测试看起来像这个可重现示例的各种条件。
set.seed (17)
year < - 1980 + rnbinom(10000,3,0.35)
event< - rep(LETTERS,length.out = 10000)
z< - as.integer ·runif(10000,min = 0,max = 10)
dt setkey(dt,event,year)
dt& - dt [,sum(z),by = c(event,year)]
V1
(出现在最后一个命令中)表示事件发生的计数。
一个有序数组和我需要执行各种功能就可以了。以下是一些示例:
-
如何计算前10年发生的滚动总和每个事件?因此,对于 A 1990 ,所需的输出为1,452 (1980年至1989年)。对于 H 2012 ,产出为11 ,因为在2002年至2011年期间,仅发生11次(2002年为3次,2007年为3次,2010年为5次)。对于 A 1983 ,输出为
NA
- p>如何检查事件是否发生在至少12年前的15年?因此,对于 A 1997 ,我们可以看到,事件发生在15年前的12年以上(1982 - 1996年,除1996年以外的每年),因此达到标准。但是,对于 A 2001 ,我们发现该事件仅发生在15年前(1986 - 2000年)中的11个,它不会在1996,1998,1999和2000年发生)遇见。此处所需的输出将是离散1(标准满足)或0(不符合标准)
$ b b
理想情况下,代码将使得不仅可以在 data.table $ c中发生的
年
$ c>,但也缺少1980年和2013年之间的那些。因此,对于 K 2005 ,我们可以将Q1的结果计算为25(13 + 5 + 3 + 3 + 2)(感谢@Arun指向前一个错误)。对于Q2,我们看到事件不发生在1999,2000,2001,2003和2004年,因此标准至少在15年中的12年不满足。此外,可能在data.table中存在事件 - 年组合,但是V1具有值0(参见行18,A 2001)。理想情况下,这样的零次出现将被视为不出现(例如通过删除V1为零的所有行)。
我知道发布两个问题不常见,但我感觉他们属于一起,真的涉及类似的问题。希望有人能提出一些建议。
非常感谢,
Simon
对于您的第一个问题:
在数据集(以及你所要求的两个点下)。想法是首先生成 event
和 year
的所有组合 - 即使是在数据集中不存在的组合。这可以通过函数 CJ
(用于交叉连接)来完成。这将为每个事件
创建所有年
。
setkey(dt,event,year)
d1 = CJ(event = unique(dt $ event),year = min(dt $ year):max年))
现在,我们加入
用 dt
填充 V1
的缺失值。
d1 = dt [d1]
包含 event
和 year
的所有组合的数据集。从这里,我们现在要找到一种方式来执行滚动总和。为此,我们再次创建另一个数据集,其中包含每年的所有前10年,如下:
window_size = 10L
d2 = d1 [,list(window = seq(year-window_size,year-1L,by = 1L)),by =event,year]
对于每个事件,年,我们创建一个新列 window
现在,我们要做的就是设置键
列并执行 join
以获取相应的V1值。
setkey(d2,event,window)##注意这里的join是在event,window上
setkey(d1,event,year)
ans = d1 [d2]
现在,每个事件,窗口组合的值为V1。所有我们要做的是通过事件,年1(年1以前是年,并在 ans
窗口)。这里,我们关注的条件是,如果任何年份< 1980,那么总和应为NA。这是通过使用 TRUE |的小黑客来完成的NA = TRUE
和 FALSE | NA = NA
。
q1 = ans [,sum(V1,na.rm = *(!any(year< 1980)| NA),by =event,year.1]
q1 [event ==K& year.1 ==2005]
#event year.1 V1
#1:K 2005 25
对于您的第二个问题:
使用 window_size = 15L
而不是10L,并起床,直到 ans
。然后,我们可以做:
q2 = ans [!is.na(V1)] [,.N,by =事件,年1]
q2 [event ==A& year.1 == 1997]
#event year.1 N
#1:A 1997 14
这是正确的,因为 dt
有从1982-1995的所有年份,并且1996缺少,因此不计数=> N = 14
,因为它应该是。
I am trying to test a variety of conditions in a data.table that looks like this reproducible example
set.seed(17)
year <- 1980 + rnbinom(10000,3,0.35)
event <- rep(LETTERS, length.out=10000)
z <- as.integer(runif(10000,min = 0, max = 10))
dt <- data.table(event,year,z)
setkey(dt, event,year)
dt <- dt[,sum(z), by=c("event","year")]
V1
(which emerges from the last command) represents a count of event occurences.
So the data table is an ordered array and I need to execute a variety of functions on it. Here are some examples:
How do I calculate a rolling sum (or rolling mean) of the occurences in 10 prior years for each event? So for A 1990 the desired output is 1,452 (between 1980 and 1989). For H 2012, the output is 11 because between 2002 and 2011 there are only 11 occurences (3 in 2002, 3 in 2007, and 5 in 2010). For A 1983 the output is
NA
How can I check whether an event occurs in at least 12 out of 15 prior years? So for A 1997 we can see that the event occurred in more than 12 years in the 15 years prior (1982 - 1996, it happened in every year besides 1996) thus criterium met. However, for A 2001 we see that the event only occurs in 11 of 15 prior years (1986 - 2000), it does not happen in 1996,1998,1999,and 2000) criterium not met. The desired output here would be a discrete 1 (criterium met) or 0 (criterium not met)
Ideally the code would enable the calculation of both 1 and 2 not only for years
that occur in the data.table
but also for those between 1980 and 2013 that are missing. So for K 2005, we can calculate the outcome for Q1 as 25 (13 + 5 + 3 + 3 + 2) (thanks @Arun for pointing the former error out). For Q2, we see the event does not occur in 1999,2000,2001,2003, and 2004 hence the criterium "at least in 12 out of 15 years" is not met. Also, it is possible that the event-year combination exists in the data.table but that V1 has value 0 (see row 18, A 2001). Ideally, such zero occurences would be treated as non-occurences (e.g. by deleting all rows for which V1 is zero).
I know it's uncommon to post two questions but I feel they belong together and really relate to similar problems. Hope someone can make some suggestions.
Thanks a lot,
Simon
For your first question:
This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). The idea is to first generate all combinations of event
and year
- even the ones which doesn't exist in the dataset. This can be accomplished by the function CJ
(for crossjoin). This'll, for each event
, create all year
.
setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))
Now, we join
back with dt
to fill the missing values for V1
with NA.
d1 = dt[d1]
Now we've a dataset with all combinations of event
and year
. From here, we've to now find a way to perform the rolling sum. For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows:
window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]
For each "event,year", we create a new column window
, that'll generate the previous 10 years.
Now, all we've to do is to set the key
columns appropriately and perform a join
to get the corresponding "V1" values.
setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)
ans = d1[d2]
Now, we've the values of "V1" for each "event,window" combination. All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in ans
was previously "window"). Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. This is done by using a small hack that TRUE | NA = TRUE
and FALSE | NA = NA
.
q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]
q1[event == "K" & year.1 == "2005"]
# event year.1 V1
# 1: K 2005 25
For your second question:
Repeat the same as above with window_size = 15L
instead of 10L and get up until ans
. Then, we can do:
q2 = ans[!is.na(V1)][, .N, by="event,year.1"]
q2[event == "A" & year.1 == 1997]
# event year.1 N
# 1: A 1997 14
This is correct because dt
has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14
, as it should be.
这篇关于R - 使用data.table来跨多个行和列有效地测试滚动条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!