基于日期范围的数据表合并 [英] Data Table merge based on date ranges

查看:106
本文介绍了基于日期范围的数据表合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表, policies claim

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), 
                EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
   policyNumber       EFDT       EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01


claims<-data.table(claimNumber=c(1,2,3,4), 
                   policyNumber=c(123,123,123,124),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
                   claimAmount=c(10,20,20,15))
> claims
   claimNumber policyNumber   lossDate claimAmount
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15

-terms,因为每一行由策略号以及生效日期唯一标识。

The policy table really contains policy-terms, since each row is uniquely identified by a policy number along with an effective date.

我想以一种方式合并两个表,术语。如果索赔具有相同的策略编号,并且索赔的lossDate属于策略术语的生效日期和到期日期(有效日期是包含边界,到期日期是排除边界),则该索赔与策略术语相关联。如何我以这种方式合并表?

I want to merge the two tables in a way that associates claims with policy-terms. A claim is associated with a policy term if it has the same policy number and the lossDate of the claim falls within the effective date and expiration date of the policy-term (effective dates are inclusive bounds and expiration dates are exclusive bounds.) How do I merge the tables in this way?

这应该类似于左外连接。结果应类似于

This should be similar to a left outer join. The result should look like

   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:          125 2013-02-01 2014-02-01          NA       <NA>          NA


推荐答案

版本1表格v1.9.4 +)

尝试:

# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126), 
                     EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                     EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))

# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                   policyNumber=c(123,123,123,124,126,126),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                   claimAmount=c(10,20,20,15,5,25))

# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)

# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]

# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]

# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)

# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))

ans
#   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
#1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
#2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
#3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
#4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
#5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
#6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
#7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

版本2

@Arun建议使用 data.table 中的新 foverlaps 函数。

@Arun suggested using the new foverlaps function from data.table. My attempt below seems harder, not easier, so please let me know how to improve it.

## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate]  ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)

## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]

## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]

## Remove duplicates (such as policyNumber==123 & claimNumber==3),
##   and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
  unique(ans2), ## select only the unique values
  policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)

ans2
##    policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
## 1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
## 2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
## 3:          123 2012-01-01 2013-01-01           3 2013-01-01          20     TRUE
## 4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
## 5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
## 6:          126       <NA>       <NA>           6 2014-03-01          25    FALSE
## 7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA






>


Version 3

使用 foverlaps(),另一个版本:

require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))

foverlaps() / em>和结束范围/间隔。因此,我们将 lossDate 列复制到 lossDate2

foverlaps() requires both start and end ranges/intervals. Therefore, we duplicate lossDate column on to lossDate2.

由于 EXDT 需要打开间隔,我们从中减去一个,并将其放在一个新列 EXDTclosed

Since EXDT needs to be open interval, we subtract one from it, and place it in a new column EXDTclosed.

现在,我们设置键。 foverlaps()需要最后两个键列为间隔。所以他们是最后指定的。我们还希望重叠联接,以使 policyNumber 匹配。

Now, we set the key. foverlaps() requires the last two key columns to be intervals. So they're specified last. And we also want overlapping join to first match by policyNumber. Hence, it's also specified in the key.

我们需要在 claim c>?foverlaps )。我们不必在 policies 上设置键。但你可以如果你愿意(那么你可以跳过 by.x 参数,因为它默认采用键值)。因为我们没有在 policies 中设置键,所以我们将明确指定 by.x 论据。默认情况下,重叠类型为 any ,我们不必更改(因此未指定)。这导致:

We need to set key on claims (check ?foverlaps). We don't have to set key on policies. But you can if you wish (then you can skip by.x argument as it by default takes the key value). Since we don't set the key for policies here, we'll specify explicitly the corresponding columns in by.x argument. The overlap type by default is any, which we don't have to change (and therefore not specified). This results in:

#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31

这篇关于基于日期范围的数据表合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆