使用 data.table 包通过引用进行条件二进制连接和更新 [英] Conditional binary join and update by reference using the data.table package

查看:13
本文介绍了使用 data.table 包通过引用进行条件二进制连接和更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以这是我现实生活中的问题,我觉得它很容易解决,但我在这里遗漏了一些明显的东西.我有两个名为 TKDFT

So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK and DFT

library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2), 
                  DFT_id = rep(1:3, 2), 
                  New_id = sample(1e4, 6),
                  key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        1      2   7883
# 3:        2      3   4089
# 4:        2      1   8828
# 5:        3      2   9401
# 6:        3      3    456

(DFT <- data.table(venue_id = rep(1:2, each = 2), 
                   DFT_id = 1:4, 
                   New_id = sample(4),
                   key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        1      2      4
# 3:        2      3      2
# 4:        2      4      1

我想在 venue_id %in% 1:2<时对 DFT_id 列上的 TK 执行二进制左连接/code>,同时更新 New_id 通过引用.换句话说,期望的结果是

I want to perform a binary left join to TK on the DFT_id column when venue_id %in% 1:2, while updating New_id by reference. In other words, the desired result would be

TK
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

我想结合这两个条件,但没有奏效(仍然不知道为什么)

I was thinking to combine both conditions, but it didn't work (still not sure why)

TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) : 
#   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.

我的下一个想法是使用链接,它通过正确连接来部分实现目标,但在一些临时表上而不实际影响TK

My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK

TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        2      1   8828
# 3:        1      2   7883
# 4:        3      2   9401
# 5:        2      3   4089
# 6:        3      3    456

所以说清楚,我很清楚我可以将 TK 拆分成两个表,执行连接,然后再次执行 rbind,但我做了很多像这样的不同条件连接,我也在寻找速度和内存高效的解决方案.

So to make clear, I'm well aware that I can split TK into two tables, perform the join and then rbind again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.

这也意味着我在寻找dplyr解决方案,因为我试图同时使用二进制连接通过引用更新 仅存在于 data.table 包 IIRC 中的功能.

This also means that I am not looking for a dplyr solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table package IIRC.

有关其他信息,请参阅这些小插曲:

For additional information see these vignettes:

推荐答案

从 Arun 的更新答案中复制 此处

Copying from Arun's updated answer here

TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

他的回答详细说明了正在发生的事情.

His answer gives the details of what is going on.

这篇关于使用 data.table 包通过引用进行条件二进制连接和更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆