查找与多个表的匹配项:使用data.table进行条件(完全)联接 [英] Find matches to several tables: conditional (full) join using data.table

查看:142
本文介绍了查找与多个表的匹配项:使用data.table进行条件(完全)联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能有一个简单的解决方案,但我似乎无法破解.

This might have an easy solution, but I can't seem to crack it.

例如,假设我有一张表,列出购买和客户详细信息:

For an example, suppose I have a table listing purchases and customer details:

library(data.table)
purchase <- setDT(structure(list(Name = c("John", "John", "Mary"), Surname = c("Smith", 
"Smith", "Jane"), PurchaseDate = c("2017-01-01", "2015-01-01", 
"2017-01-02")), .Names = c("Name", "Surname", "PurchaseDate"), row.names = c(NA, 
-3L), class = c("data.table", "data.frame")))

> purchase
   Name Surname PurchaseDate
1: John   Smith   2017-01-01
2: John   Smith   2015-01-01
3: Mary    Jane   2017-01-02

我想确定这些客户在购买时是否持有有效的折扣卡,该卡与两个数据库中保存的数据相匹配:

And I want to find if these customers held a valid discount card at a time of purchase, which matches data held in two databases:

df1 <- setDT(structure(list(Name = "John", Surname = "Smith", ValidFrom = "2016-12-31", 
    ValidTo = "2017-01-02"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

df2 <- setDT(structure(list(Name = "Mary", Surname = "Jane", ValidFrom = "2017-01-01", 
    ValidTo = "2017-01-03"), .Names = c("Name", "Surname", "ValidFrom", 
"ValidTo"), row.names = c(NA, -1L), class = c("data.table", "data.frame")))

> df1
   Name Surname  ValidFrom    ValidTo
1: John   Smith 2016-12-31 2017-01-02
> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03

我正在适应解决方案,该解决方案使用了data.table

I am adapting this solution, which uses data.table

library(data.table)
purchase[df1, on=c(Name='Name', Surname='Surname'), Match := 'Yes']
purchase[df2, on=c(Name='Name', Surname='Surname'), Match := 'Yes']

此操作的结果(基于左连接)将保存到原始purchase表中的Match变量中. (重要的是,这不需要创建新对象,而是将结果保存到原始对象,否则会变得混乱.)

The results of this (based on left join) are saved to the Match variable in the original purchase table. (Importantly, this does not require creating a new object, but saves the result to the original object, otherwise it would become messy.)

> purchase
   Name Surname PurchaseDate Match
1: John   Smith   2017-01-01   Yes
2: John   Smith   2015-01-01   Yes
3: Mary    Jane   2017-01-02   Yes

但是,我还需要检查PurchaseDate是否在ValidFromValidTo日期之内,并且对如何执行此操作不太了解.

But, I also need to check that the PurchaseDate is within ValidFrom and ValidTo dates, and don't have a good idea of how to do this.

为此,我可以将ValidFromValidTo日期加入联接中,然后使用ifelse确定购买是否在这些日期之间.

To do this, I can bring in the ValidFrom and ValidTo dates to the join and later determine whether the purchase was between those dates using ifelse.

purchase[df1, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]
purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

太好了!这带来了日期:

Great! This brings the dates:

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 2017-01-01 2017-01-03

但是,如果客户有两张折扣卡,而购买仅落在其中一张折扣卡的有效期内,就会出现问题.假设玛丽有两张卡:

But, the problem arises if customer had two discount cards, and a purchase falls in the period of validity of only one of them. Suppose Mary had two cards:

df2 <- setDT(structure(list(Name = structure(c(1L, 1L), .Label = "Mary", class = "factor"), 
    Surname = structure(c(1L, 1L), .Label = "Jane", class = "factor"), 
    ValidFrom = structure(1:2, .Label = c("2017-01-01", "1945-01-01"
    ), class = "factor"), ValidTo = structure(1:2, .Label = c("2017-01-03", 
    "1946-01-01"), class = "factor")), .Names = c("Name", "Surname", 
"ValidFrom", "ValidTo"), row.names = c(NA, -2L), class = c("data.table", "data.frame")))

> df2
   Name Surname  ValidFrom    ValidTo
1: Mary    Jane 2017-01-01 2017-01-03
2: Mary    Jane 1945-01-01 1946-01-01

运行此

purchase[df2, on=c(Name='Name', Surname='Surname'), `:=`(Match='Yes', VFrom=ValidFrom, VTo=ValidTo)]

仅带来这些日期对中的一个(显然是最早的日期,与行号无关).

Brings only one of these pairs of dates (apparently the earlies one, regardless of the row number).

   Name Surname PurchaseDate Match      VFrom        VTo
1: John   Smith   2017-01-01   Yes 2016-12-31 2017-01-02
2: John   Smith   2015-01-01   Yes 2016-12-31 2017-01-02
3: Mary    Jane   2017-01-02   Yes 1945-01-01 1946-01-01

我如何引入所有匹配的行?

How would I bring in all matched rows?

据我了解,X[Y]语法支持追加到原始对象(我需要),还支持追加到我需要的:=函数,但不支持完全联接.另一种merge支持完全联接,但是需要在每个联接步骤中创建新对象(这将非常混乱),并且不支持:=.有任何想法吗?有没有办法以某种方式使用foverlaps?

From what I learned, X[Y] syntax supports appending to the original object (which I need), and also the := function, which I need, but does not support a full join. An alternative merge supports full joins, but requires creation of new object at every join-step (will be extremely messy), and does not support :=. Any ideas? Is there a way to use foverlaps somehow?

推荐答案

这里是一种解决方法:

# clean data
purchase[, PurchaseDate := as.IDate(PurchaseDate)]
df1[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]
df2[, `:=`(ValidFrom = as.IDate(ValidFrom), ValidTo = as.IDate(ValidTo))]

# initialize
purchase[, matched := FALSE ]

# update joins
purchase[!(matched), matched := 
  df1[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]
purchase[!(matched), matched := 
  df2[.SD, on=.(Name, Surname, ValidFrom <= PurchaseDate, ValidTo >= PurchaseDate), 
    .N, by=.EACHI ]$N > 0L
]

我将df1df2分开,因为OP提到它们的连接规则在实际用例中有所不同.

I'm keeping df1 and df2 separate since the OP mentioned that their join rules differ in their actual use-case.

工作原理

总体结构是...

DT[, matched := FALSE ]
DT[!(matched), matched := expr1 ]
DT[!(matched), matched := expr2 ]

因此我们将matched初始化为false;然后在每个后续步骤中更新不匹配的行!(matched).

so we're initializing matched to false; and at each later step, updating unmatched rows, !(matched).

表达式以DT2[.SD, ...]开头,这只是对使用!(matched)过滤后的数据子集的联接.这样的联接根据on=过滤器在DT2中查找.SD的行.在这种情况下,on=过滤器与非等联接关联.***

The expressions start with DT2[.SD, ...], which is just a join against the Subset of Data we have after filtering with !(matched). Joins like this look up rows of .SD in DT2 according to the on= filters. In this case, the on= filters are associated with a non equi join.***

当使用by=.EACHI时,我们将按.SD的每一行进行分组.使用.N, by=.EACHI,我们获得与.SD的每一行匹配的DT2行的数量.

When we use by=.EACHI we are grouping by each row of .SD. With .N, by=.EACHI, we get the number of DT2 rows matched for each row of .SD.

一旦我们有了匹配的行数,就可以比较N > 0L来更新matched.

Once we have the number of matched rows, we can compare N > 0L to update matched.

***不幸的是,自2017年4月起有一个打开的错误在这种使用模式下,有时会出现关于.SD的错误.解决方法是将.SD替换为copy(.SD).

*** Unfortunately, as of April 2017 there's an open bug in this pattern of usage that sometimes gives an error about .SD. The workaround is to replace .SD with copy(.SD).

这篇关于查找与多个表的匹配项:使用data.table进行条件(完全)联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆