在R中:如何按组对两个日期之间的变量求和 [英] In R: how to sum a variable by group between two dates
问题描述
我有两个数据帧(DF1和DF2):
I have two data frames (DF1 and DF2):
(1)DF1包含有关个人级别的信息,即10.000个体在11年(2000-2011年)内以30个单位嵌套。它包含四个变量:
(1) DF1 contains information on individual-level, i.e. on 10.000 individuals nested in 30 units across 11 years (2000-2011). It contains four variables:
- 个人(每个人的数字ID;范围为1-10.000)
- 单位(每个单位的数字ID;范围从1到30)
- date1(日期格式的日期,即2000-01-01等);范围从2000-01-01到2010-12-31)
- date2( Date1 +1年)
- "individual" (numeric id for each individual; ranging from 1-10.000)
- "unit" (numeric id for each unit; ranging from 1-30)
- "date1" (a date in date format, i.e. 2000-01-01, etc; ranging from 2000-01-01 to 2010-12-31)
- "date2" ("Date1" + 1 year)
(2)DF2包含有关单位级别的信息,即在同一时间段(2000-2011年)中与DF1相同的30个单位,并且还包含一个数字变量( x):
(2) DF2 contains information on unit-level, i.e. on the same 30 units as in DF1 across the same time period (2000-2011) and further contains a numeric variable ("x"):
- 单位(每个单位的数字ID;范围为1-30)
- 日期(日期格式,例如2000-01-01等;范围从2000-01-01到2011-12-31)
- x(数字变量,范围从0到200)
我想创建新变量( newvar)这样就为我每个单位的每个个体提供了从 date1(DF1)到 date2(DF2)的 x(DF2)的总和。这意味着我想将此新变量添加到DF1。
I would like to create new variable ("newvar") that gives me for each "individual" per "unit" the sum of "x" (DF2) counting from "date1" (DF1) to "date2" (DF2). This means that I would like to add this new variable to DF1.
例如,如果 unit = 1中的 individual = 1具有 date1 = 2000 -01-01和 date2 = 2001-01-01,并且在DF2中, unit = 1在时间段 date1至 date2(即2000-01-01至2001-01-01)中具有三个观测值),其中 x = 1, x = 2和 x = 3,那么我想添加一个新变量,以 unit = 1 newvar = 6中的 individual = 1表示。
For instance, if "individual"=1 in "unit"=1 has "date1"=2000-01-01 and "date2"=2001-01-01, and in DF2 "unit"=1 has three observations in the time period "date1" to "date2" (i.e. 2000-01-01 to 2001-01-01) with "x"=1, "x"=2 and "x"=3, then I would like add a new variable that gives for "individual"=1 in "unit"=1 "newvar"=6.
我假设我需要在R中使用for循环,并且一直在使用以下代码:
I assume that I need to use a for loop in R and have been using the following code:
for(i in length(DF1)){
DF1$newvar[i] <-sum(DF2$x[which(DF1$date == DF1$date1[i] &
DF1$date == DF1P$date1[i] &
DF2$unit == DF1P$unit[i]),])
}
但收到错误消息:
Error in DF2$x[which(DF2$date == : incorrect number of dimensions
关于如何创建此变量的任何想法
Any ideas of how to create this variable would be tremendously appreciated!
这里是一个小例子,也是预期的输出,为简单起见,使用一个单位:
Here is a small example as well as the expected output, using one unit for the sake of simplicity:
假设DF1如下:
individual unit date1 date2
1 1 2000-01-01 2001-01-01
2 1 2000-02-02 2001-02-02
3 1 2000-03-03 2000-03-03
4 1 2000-04-04 2000-04-04
5 1 2000-12-31 2001-12-31
(...)
996 1 2010-01-01 2011-01-01
997 1 2010-02-15 2011-02-15
998 1 2010-03-05 2011-03-05
999 1 2010-04-10 2011-04-10
1000 1 2010-12-27 2011-12-27
1001 2 2000-01-01 2001-01-01
1002 2 2000-02-02 2001-02-02
1003 2 2000-03-03 2000-03-03
1004 2 2000-04-04 2000-04-04
1005 2 2000-12-31 2001-12-31
(...)
1996 2 2010-01-01 2011-01-01
1997 2 2010-02-15 2011-02-15
1998 2 2010-03-05 2011-03-05
1999 2 2010-04-10 2011-04-10
2000 2 2010-12-027 2011-12-27
(...)
3000 34 2000-02-02 2002-02-02
3001 34 2000-05-05 2001-05-05
3002 34 2000-06-06 2001-06-06
3003 34 2000-07-07 2001-07-07
3004 34 2000-11-11 2001-11-11
(...)
9996 34 2010-02-06 2011-02-06
9997 34 2010-05-05 2011-05-05
9998 34 2010-09-09 2011-09-09
9999 34 2010-09-25 2011-09-25
10000 34 2010-10-15 2011-10-15
假设DF2如下:
unit date x
1 2000-01-01 1
1 2000-05-01 2
1 2000-12-01 3
1 2001-01-02 10
1 2001-07-05 20
1 2001-12-31 30
(...)
2 2010-05-05 1
2 2010-07-01 1
2 2010-08-09 1
3 (...)
这是我希望DF1运行后的外观代码:
This is what I would like DF1 to look like after running the code:
individual unit date1 date2 newvar
1 1 2000-01-01 2001-01-01 6
2 1 2000-02-02 2001-02-02 16
3 1 2000-03-03 2001-03-03 15
4 1 2000-04-04 2001-04-04 15
5 1 2000-12-31 2001-12-31 60
(...)
996 1 2010-01-01 2011-01-01 3
997 1 2010-02-15 2011-02-15 2
998 1 2010-03-05 2011-03-05 2
999 1 2010-04-10 2011-04-10 2
1000 1 2010-12-27 2011-12-27 0
(...)
但是,我不能简单地汇总:想象一下我n DF1的每个单元在2000年至2011年之间每年都有数百个人。DF2的每个单元在2000-2011年期间都有很多观测值。
However, I cannot simply aggregate: Imagine that in DF1 each "unit" has several hundreds of individuals for each year between 2000 and 2011. And DF2 has many observations for each unit across the years 2000-2011.
推荐答案
我们可以使用 data.table
library(data.table)
setDT(DF1)
setDT(DF2)
DF1[DF2[, .(newvar = sum(x)), .(unit, individual = cumsum(date %in% DF1$date1))],
newvar := newvar, on = .(individual, unit)]
DF1
# individual unit date1 date2 newvar
#1: 1 1 2000-01-01 2001-01-01 6
#2: 2 1 2001-01-02 2002-01-02 60
或者我们可以使用非等额联接
Or we can use a non-equi join
DF1[DF2[DF1, sum(x), on = .(unit, date >= date1, date <= date2),
by = .EACHI], newvar := V1, on = .(unit, date1=date)]
DF1
# individual unit date1 date2 newvar
#1: 1 1 2000-01-01 2001-01-01 6
#2: 2 1 2001-01-02 2002-01-02 60
这篇关于在R中:如何按组对两个日期之间的变量求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!