基于日期范围的数据表合并 [英] Data Table merge based on date ranges
问题描述
我有两个表, policies
和 claim
policies<-data.table(policyNumber=c(123,123,124,125),
EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")),
EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
policyNumber EFDT EXDT
1: 123 2012-01-01 2013-01-01
2: 123 2013-01-01 2014-01-01
3: 124 2013-01-01 2014-01-01
4: 125 2013-02-01 2014-02-01
claims<-data.table(claimNumber=c(1,2,3,4),
policyNumber=c(123,123,123,124),
lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
claimAmount=c(10,20,20,15))
> claims
claimNumber policyNumber lossDate claimAmount
1: 1 123 2012-02-01 10
2: 2 123 2012-08-15 20
3: 3 123 2013-01-01 20
4: 4 124 2013-10-31 15
-terms,因为每一行由策略号以及生效日期唯一标识。
The policy table really contains policy-terms, since each row is uniquely identified by a policy number along with an effective date.
我想以一种方式合并两个表,术语。如果索赔具有相同的策略编号,并且索赔的lossDate属于策略术语的生效日期和到期日期(有效日期是包含边界,到期日期是排除边界),则该索赔与策略术语相关联。如何我以这种方式合并表?
I want to merge the two tables in a way that associates claims with policy-terms. A claim is associated with a policy term if it has the same policy number and the lossDate of the claim falls within the effective date and expiration date of the policy-term (effective dates are inclusive bounds and expiration dates are exclusive bounds.) How do I merge the tables in this way?
这应该类似于左外连接。结果应类似于
This should be similar to a left outer join. The result should look like
policyNumber EFDT EXDT claimNumber lossDate claimAmount
1: 123 2012-01-01 2013-01-01 1 2012-02-01 10
2: 123 2012-01-01 2013-01-01 2 2012-08-15 20
3: 123 2013-01-01 2014-01-01 3 2013-01-01 20
4: 124 2013-01-01 2014-01-01 4 2013-10-31 15
5: 125 2013-02-01 2014-02-01 NA <NA> NA
推荐答案
版本1表格v1.9.4 +)
尝试:
# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126),
EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")),
EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))
# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6),
policyNumber=c(123,123,123,124,126,126),
lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
claimAmount=c(10,20,20,15,5,25))
# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)
# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]
# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]
# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)
# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))
ans
# policyNumber EFDT EXDT claimNumber lossDate claimAmount inPolicy
#1: 123 2012-01-01 2013-01-01 1 2012-02-01 10 TRUE
#2: 123 2012-01-01 2013-01-01 2 2012-08-15 20 TRUE
#3: 123 2013-01-01 2014-01-01 3 2013-01-01 20 TRUE
#4: 124 2013-01-01 2014-01-01 4 2013-10-31 15 TRUE
#5: 126 <NA> <NA> 5 2012-06-01 5 FALSE
#6: 126 2013-02-01 2014-02-01 6 2014-03-01 25 FALSE
#7: 125 2013-02-01 2014-02-01 NA <NA> NA NA
版本2
@Arun建议使用 data.table
中的新 foverlaps
函数。
@Arun suggested using the new foverlaps
function from data.table
. My attempt below seems harder, not easier, so please let me know how to improve it.
## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate] ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)
## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]
## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]
## Remove duplicates (such as policyNumber==123 & claimNumber==3),
## and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
unique(ans2), ## select only the unique values
policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)
ans2
## policyNumber EFDT EXDT claimNumber lossDate claimAmount inPolicy
## 1: 123 2012-01-01 2013-01-01 1 2012-02-01 10 TRUE
## 2: 123 2012-01-01 2013-01-01 2 2012-08-15 20 TRUE
## 3: 123 2012-01-01 2013-01-01 3 2013-01-01 20 TRUE
## 4: 124 2013-01-01 2014-01-01 4 2013-10-31 15 TRUE
## 5: 126 <NA> <NA> 5 2012-06-01 5 FALSE
## 6: 126 <NA> <NA> 6 2014-03-01 25 FALSE
## 7: 125 2013-02-01 2014-02-01 NA <NA> NA NA
>
Version 3
使用 foverlaps()
,另一个版本:
require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))
foverlaps()
/ em>和结束范围/间隔。因此,我们将 lossDate
列复制到 lossDate2
。
foverlaps()
requires both start and end ranges/intervals. Therefore, we duplicate lossDate
column on to lossDate2
.
由于 EXDT
需要打开间隔,我们从中减去一个,并将其放在一个新列 EXDTclosed
。
Since EXDT
needs to be open interval, we subtract one from it, and place it in a new column EXDTclosed
.
现在,我们设置键。 foverlaps()
需要最后两个键列为间隔。所以他们是最后指定的。我们还希望重叠联接,以使 policyNumber
匹配。
Now, we set the key. foverlaps()
requires the last two key columns to be intervals. So they're specified last. And we also want overlapping join to first match by policyNumber
. Hence, it's also specified in the key.
我们需要在 claim
c>?foverlaps )。我们不必在 policies
上设置键。但你可以如果你愿意(那么你可以跳过 by.x
参数,因为它默认采用键值)。因为我们没有在 policies
中设置键,所以我们将明确指定 by.x
论据。默认情况下,重叠类型为 any
,我们不必更改(因此未指定)。这导致:
We need to set key on claims
(check ?foverlaps
). We don't have to set key on policies
. But you can if you wish (then you can skip by.x
argument as it by default takes the key value). Since we don't set the key for policies
here, we'll specify explicitly the corresponding columns in by.x
argument. The overlap type by default is any
, which we don't have to change (and therefore not specified). This results in:
# policyNumber claimNumber lossDate claimAmount lossDate2 EFDT EXDT EXDTclosed
# 1: 123 1 2012-02-01 10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2: 123 2 2012-08-15 20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3: 123 3 2013-01-01 20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4: 124 4 2013-10-31 15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5: 125 NA <NA> NA <NA> 2013-02-01 2014-02-01 2014-01-31
这篇关于基于日期范围的数据表合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!