基于参考表中日期的子集 [英] subset based on date in a reference table

查看:45
本文介绍了基于参考表中日期的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的table1如下.

I have table1 as follows.

   StudentId        Date1         Lunch
   23433            2014-08-26    Yes
   233989           2014-08-18    No
   909978           2014-08-06    No
   777492           2014-08-11    Yes
   3987387          2014-08-26    No

我还有另一个表,table2如下

I have another table, table2 which is as follows

Id  StudentId        Date2        Result_Nm
1   777492           2012.06.10   0.1
2   777492           2013.12.06   2.0
3   777492           2014.08.30   0.6
4   23433            2011.08.26   3.0
5   23433            2015.04.06   3.0
6   233989           2011.05.14   0.003
7   233989           2014.09.14   0.05
8   909978           2004-09-12   0.2
9   909978           2005-05-10   0.23
10  909978           2015-01-02   2.4
11  3987387          2014-10-06   3.5
12  3987387          2014-08-26   1.17

我只想保留来自table2数据集的观察结果,其中对于每个StudentId,Date2值小于Date1值.换句话说,它应该包含这些行.

I want to retain only observations from table2 dataset where the Date2 values are less than the Date1 values for each StudentId. In other words it should contain these rows.

  Id  StudentId        Date2         Result_Nm
  1   777492           2012.06.10    0.1
  2   777492           2013.12.06    2.0
  4   23433            2011.08.26    3.0
  6   233989           2014.09.14    0.05
  8   909978           2004-09-12    0.2
  9   909978           2005-05-10    0.23
  12  3987387          2014-08-26    1.17

观察3被排除在外,因为StudentId 777492的Date1值为2014-08-11,并且该值小于2014.08.30,类似地观察到5,7,10,11,依此类推.我以前使用过子集,但这更具挑战性,需要帮助.

Observation 3 is excluded because the Date1 value for StudentId 777492 is 2014-08-11 and this value is less than 2014.08.30, similarly observations 5,7,10,11 so on. I have used subset before but this is little more challenging , need help.

推荐答案

我们可以使用 lubridate 中的 ymd 将日期"列更改为日期"类.它可以采用多种格式(.-).通过'StudentId'连接两个数据集( left_join ),使用 filter 删除行,然后 select 特定列

We can change the 'Date' columns to 'Date' class by using ymd from lubridate. It can take multiple formats (., -). Join the two dataset (left_join) by 'StudentId', remove the rows using filter and select the specific columns

library(lubridate) 
library(dplyr)
df2$Date2 <- ymd(df2$Date2)
df1$Date1 <- ymd(df1$Date1)

left_join(df2, df1, by='StudentId') %>% 
                     filter(Date2 <=Date1) %>% 
                     select(1:4)
#    Id StudentId      Date2 Result_Nm
#1  1    777492 2012-06-10     0.100
#2  2    777492 2013-12-06     2.000
#3  4     23433 2011-08-26     3.000
#4  6    233989 2011-05-14     0.003
#5  8    909978 2004-09-12     0.200
#6  9    909978 2005-05-10     0.230
#7 12   3987387 2014-08-26     1.170

或者我们可以使用 data.table .这里我们将'df2'从'data.frame'转换为'data.table'( setDT ),将密钥设置为'StudentId'( setkey(...,StudentId)), join 和子集"df1"("StudentId","Date1"),并根据条件( .SD [Date2< = Date1] )按键"变量分组.有关 .EACHI 的更多信息,请 此处

Or we can use data.table. Here we convert the 'df2' from 'data.frame' to 'data.table' (setDT), set the key as 'StudentId' (setkey(..., StudentId)), join with a subset of 'df1' ('StudentId', 'Date1'), filter the output dataset based on the condition (.SD[Date2 <= Date1]) grouped by the 'key' variable. More info about .EACHI is here

library(data.table)
setkey(setDT(df2),StudentId)[df1[1:2], .SD[Date2<=Date1],by=.EACHI][order(Id)]
#   StudentId Id      Date2 Result_Nm
#1:    777492  1 2012-06-10     0.100
#2:    777492  2 2013-12-06     2.000
#3:     23433  4 2011-08-26     3.000
#4:    233989  6 2011-05-14     0.003
#5:    909978  8 2004-09-12     0.200
#6:    909978  9 2005-05-10     0.230
#7:   3987387 12 2014-08-26     1.170

注意::在 join 之前,日期"已更改为日期"类.

NOTE: The 'Dates' were already changed to 'Date' class before the join.

df1 <-  structure(list(StudentId = c(23433L, 233989L, 909978L,
777492L, 
3987387L), Date1 = c("2014-08-26", "2014-08-18", "2014-08-06", 
"2014-08-11", "2014-08-26"), Lunch = c("Yes", "No", "No", "Yes", 
"No")), .Names = c("StudentId", "Date1", "Lunch"), 
class = "data.frame", row.names = c(NA, -5L))

df2 <-  structure(list(Id = 1:12, StudentId = c(777492L, 777492L, 
777492L, 
23433L, 23433L, 233989L, 233989L, 909978L, 909978L, 909978L, 
3987387L, 3987387L), Date2 = c("2012.06.10", "2013.12.06", 
"2014.08.30", 
"2011.08.26", "2015.04.06", "2011.05.14", "2014.09.14", "2004-09-12", 
"2005-05-10", "2015-01-02", "2014-10-06", "2014-08-26"), 
Result_Nm = c(0.1, 
2, 0.6, 3, 3, 0.003, 0.05, 0.2, 0.23, 2.4, 3.5, 1.17)),
.Names = c("Id", 
"StudentId", "Date2", "Result_Nm"), class = "data.frame", 
row.names = c(NA, -12L))

这篇关于基于参考表中日期的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆