基于参考表中日期的子集 [英] subset based on date in a reference table
问题描述
我的table1如下.
I have table1 as follows.
StudentId Date1 Lunch
23433 2014-08-26 Yes
233989 2014-08-18 No
909978 2014-08-06 No
777492 2014-08-11 Yes
3987387 2014-08-26 No
我还有另一个表,table2如下
I have another table, table2 which is as follows
Id StudentId Date2 Result_Nm
1 777492 2012.06.10 0.1
2 777492 2013.12.06 2.0
3 777492 2014.08.30 0.6
4 23433 2011.08.26 3.0
5 23433 2015.04.06 3.0
6 233989 2011.05.14 0.003
7 233989 2014.09.14 0.05
8 909978 2004-09-12 0.2
9 909978 2005-05-10 0.23
10 909978 2015-01-02 2.4
11 3987387 2014-10-06 3.5
12 3987387 2014-08-26 1.17
我只想保留来自table2数据集的观察结果,其中对于每个StudentId,Date2值小于Date1值.换句话说,它应该包含这些行.
I want to retain only observations from table2 dataset where the Date2 values are less than the Date1 values for each StudentId. In other words it should contain these rows.
Id StudentId Date2 Result_Nm
1 777492 2012.06.10 0.1
2 777492 2013.12.06 2.0
4 23433 2011.08.26 3.0
6 233989 2014.09.14 0.05
8 909978 2004-09-12 0.2
9 909978 2005-05-10 0.23
12 3987387 2014-08-26 1.17
观察3被排除在外,因为StudentId 777492的Date1值为2014-08-11,并且该值小于2014.08.30,类似地观察到5,7,10,11,依此类推.我以前使用过子集,但这更具挑战性,需要帮助.
Observation 3 is excluded because the Date1 value for StudentId 777492 is 2014-08-11 and this value is less than 2014.08.30, similarly observations 5,7,10,11 so on. I have used subset before but this is little more challenging , need help.
推荐答案
我们可以使用 lubridate
中的 ymd
将日期"列更改为日期"类.它可以采用多种格式(.
,-
).通过'StudentId'连接两个数据集( left_join
),使用 filter
删除行,然后 select
特定列
We can change the 'Date' columns to 'Date' class by using ymd
from lubridate
. It can take multiple formats (.
, -
). Join the two dataset (left_join
) by 'StudentId', remove the rows using filter
and select
the specific columns
library(lubridate)
library(dplyr)
df2$Date2 <- ymd(df2$Date2)
df1$Date1 <- ymd(df1$Date1)
left_join(df2, df1, by='StudentId') %>%
filter(Date2 <=Date1) %>%
select(1:4)
# Id StudentId Date2 Result_Nm
#1 1 777492 2012-06-10 0.100
#2 2 777492 2013-12-06 2.000
#3 4 23433 2011-08-26 3.000
#4 6 233989 2011-05-14 0.003
#5 8 909978 2004-09-12 0.200
#6 9 909978 2005-05-10 0.230
#7 12 3987387 2014-08-26 1.170
或者我们可以使用 data.table
.这里我们将'df2'从'data.frame'转换为'data.table'( setDT
),将密钥设置为'StudentId'( setkey(...,StudentId)
), join
和子集"df1"("StudentId","Date1"),并根据条件( .SD [Date2< = Date1]
)按键"变量分组.有关 .EACHI
的更多信息,请 此处
Or we can use data.table
. Here we convert the 'df2' from 'data.frame' to 'data.table' (setDT
), set the key as 'StudentId' (setkey(..., StudentId)
), join
with a subset of 'df1' ('StudentId', 'Date1'), filter the output dataset based on the condition (.SD[Date2 <= Date1]
) grouped by the 'key' variable. More info about .EACHI
is here
library(data.table)
setkey(setDT(df2),StudentId)[df1[1:2], .SD[Date2<=Date1],by=.EACHI][order(Id)]
# StudentId Id Date2 Result_Nm
#1: 777492 1 2012-06-10 0.100
#2: 777492 2 2013-12-06 2.000
#3: 23433 4 2011-08-26 3.000
#4: 233989 6 2011-05-14 0.003
#5: 909978 8 2004-09-12 0.200
#6: 909978 9 2005-05-10 0.230
#7: 3987387 12 2014-08-26 1.170
注意::在 join
之前,日期"已更改为日期"类.
NOTE: The 'Dates' were already changed to 'Date' class before the join
.
df1 <- structure(list(StudentId = c(23433L, 233989L, 909978L,
777492L,
3987387L), Date1 = c("2014-08-26", "2014-08-18", "2014-08-06",
"2014-08-11", "2014-08-26"), Lunch = c("Yes", "No", "No", "Yes",
"No")), .Names = c("StudentId", "Date1", "Lunch"),
class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(Id = 1:12, StudentId = c(777492L, 777492L,
777492L,
23433L, 23433L, 233989L, 233989L, 909978L, 909978L, 909978L,
3987387L, 3987387L), Date2 = c("2012.06.10", "2013.12.06",
"2014.08.30",
"2011.08.26", "2015.04.06", "2011.05.14", "2014.09.14", "2004-09-12",
"2005-05-10", "2015-01-02", "2014-10-06", "2014-08-26"),
Result_Nm = c(0.1,
2, 0.6, 3, 3, 0.003, 0.05, 0.2, 0.23, 2.4, 3.5, 1.17)),
.Names = c("Id",
"StudentId", "Date2", "Result_Nm"), class = "data.frame",
row.names = c(NA, -12L))
这篇关于基于参考表中日期的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!