使用 data.table 包通过引用进行条件二进制连接和更新 [英] Conditional binary join and update by reference using the data.table package
问题描述
所以这是我现实生活中的问题,我觉得它很容易解决,但我在这里遗漏了一些明显的东西.我有两个名为 TK
和 DFT
So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK
and DFT
library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2),
DFT_id = rep(1:3, 2),
New_id = sample(1e4, 6),
key = "DFT_id"))
# venue_id DFT_id New_id
# 1: 1 1 2876
# 2: 1 2 7883
# 3: 2 3 4089
# 4: 2 1 8828
# 5: 3 2 9401
# 6: 3 3 456
(DFT <- data.table(venue_id = rep(1:2, each = 2),
DFT_id = 1:4,
New_id = sample(4),
key = "DFT_id"))
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 1 2 4
# 3: 2 3 2
# 4: 2 4 1
我想在 venue_id %in% 1:2<时对
DFT_id
列上的 TK
执行二进制左连接/code>,同时更新 New_id
通过引用.换句话说,期望的结果是
I want to perform a binary left join to TK
on the DFT_id
column when venue_id %in% 1:2
, while updating New_id
by reference. In other words, the desired result would be
TK
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456
我想结合这两个条件,但没有奏效(仍然不知道为什么)
I was thinking to combine both conditions, but it didn't work (still not sure why)
TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) :
# i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.
我的下一个想法是使用链接,它通过正确连接来部分实现目标,但在一些临时表上而不实际影响TK
My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK
TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
# venue_id DFT_id New_id
# 1: 1 1 2876
# 2: 2 1 8828
# 3: 1 2 7883
# 4: 3 2 9401
# 5: 2 3 4089
# 6: 3 3 456
所以说清楚,我很清楚我可以将 TK
拆分成两个表,执行连接,然后再次执行 rbind
,但我做了很多像这样的不同条件连接,我也在寻找速度和内存高效的解决方案.
So to make clear, I'm well aware that I can split TK
into two tables, perform the join and then rbind
again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.
这也意味着我不在寻找dplyr
解决方案,因为我试图同时使用二进制连接和通过引用更新 仅存在于 data.table
包 IIRC 中的功能.
This also means that I am not looking for a dplyr
solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table
package IIRC.
有关其他信息,请参阅这些小插曲:
For additional information see these vignettes:
推荐答案
从 Arun 的更新答案中复制 此处
Copying from Arun's updated answer here
TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456
他的回答详细说明了正在发生的事情.
His answer gives the details of what is going on.
这篇关于使用 data.table 包通过引用进行条件二进制连接和更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!