条件二进制连接和更新通过引用使用data.table包 [英] Conditional binary join and update by reference using the data.table package

查看:88
本文介绍了条件二进制连接和更新通过引用使用data.table包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以这里是我的现实生活问题,我觉得可以很容易解决,我错过了一些明显的在这里。我有两个大数据集 TK DFT

  library(data.table)
set.seed(123)
(TK< - data.table(venue_id = rep(1:3,each = 2 ),
DFT_id = rep(1:3,2),
New_id = sample(1e4,6),
key =DFT_id))

venue_id DFT_id New_id
#1:1 1 2876
#2:1 2 7883
#3:2 3 4089
#4:2 1 8828
#5: 3 2 9401
#6:3 3 456

(DFT < - data.table(venue_id = rep(1:2,each = 2),
DFT_id = 1 :4,
New_id = sample(4),
key =DFT_id))

#venue_id DFT_id New_id
#1:1 1 3
#2:1 2 4
#3:2 3 2
#4:2 4 1


b $ b

我要在 DFT_id 二进制左连接到 TK 更新 New_id 作为参考时, venue_id%in%1:2 。换句话说,所需的结果是

  TK 
#venue_id DFT_id New_id
#1:1 1 3
#2:2 1 3
#3:1 2 4
#4:3 2 9401
#5:2 3 2
#6:3 3 456

我想要合并这两个条件,但它没有工作)

  TK [venue_id%in%1:2&在[`.data.table`(TK,DFT& venue_id%in%1:2,`:=`(New_id,i.New_id))中出现错误'DFT,New_id:= i.New_id] ):
#i是无效类型(矩阵)。也许在将来,一个2列矩阵可以返回DT的元素列表(在FAQ 2.14中的A [B]的精神)。
#请让datatable帮助知道你是否愿意,或者添加你的​​意见到FR#1611。

我的下一个想法是使用链接,通过正确连接部分实现目标,但在一些临时表实际影响 TK

  TK [venue_id%in%1:2] [DFT,New_id:= i.New_id] [] 
TK
#venue_id DFT_id New_id
#1:1 1 2876
#2:2 1 8828
# 3:1 2 7883
#4:3 2 9401
#5:2 3 4089
#6:3 3 456

为了清楚起见,我很清楚我可以将 TK 分成两个表,执行连接和然后 rbind ,但我正在做许多不同的条件连接像这样,我也寻找速度和内存高效的解决方案。



这也意味着我不是寻找一个 dplyr m尝试使用二进制连接更新引用功能,只存在于 data.table 包IIRC 。






有关其他信息,请参阅这些小插曲:




  • 按参考更新

  • 二进制 //stackoverflow.com/questions/28889057/update-a-column-of-nas-in-one-data-table-with-the-value-from-a-column-in-another\">在这里

      TK [venue_id%在%1:2,New_id:= DFT [.SD,New_id]] [] 
    #venue_id DFT_id New_id
    #1:1 1 3
    #2:2 1 3
    #3:1 2 4
    #4:3 2 9401
    # 5:2 3 2
    #6:3 3 456

    正在发生什么。


    So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK and DFT

    library(data.table)
    set.seed(123)
    (TK <- data.table(venue_id = rep(1:3, each = 2), 
                      DFT_id = rep(1:3, 2), 
                      New_id = sample(1e4, 6),
                      key = "DFT_id"))
    
    #    venue_id DFT_id New_id
    # 1:        1      1   2876
    # 2:        1      2   7883
    # 3:        2      3   4089
    # 4:        2      1   8828
    # 5:        3      2   9401
    # 6:        3      3    456
    
    (DFT <- data.table(venue_id = rep(1:2, each = 2), 
                       DFT_id = 1:4, 
                       New_id = sample(4),
                       key = "DFT_id"))
    
    #    venue_id DFT_id New_id
    # 1:        1      1      3
    # 2:        1      2      4
    # 3:        2      3      2
    # 4:        2      4      1
    

    I want to perform a binary left join to TK on the DFT_id column when venue_id %in% 1:2, while updating New_id by reference. In other words, the desired result would be

    TK
    #    venue_id DFT_id New_id
    # 1:        1      1      3
    # 2:        2      1      3
    # 3:        1      2      4
    # 4:        3      2   9401
    # 5:        2      3      2
    # 6:        3      3    456
    

    I was thinking to combine both conditions, but it didn't work (still not sure why)

    TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
    # Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) : 
    #   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
    # Please let datatable-help know if you'd like this, or add your comments to FR #1611.
    

    My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK

    TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
    TK
    #    venue_id DFT_id New_id
    # 1:        1      1   2876
    # 2:        2      1   8828
    # 3:        1      2   7883
    # 4:        3      2   9401
    # 5:        2      3   4089
    # 6:        3      3    456
    

    So to make clear, I'm well aware that I can split TK into two tables, perform the join and then rbind again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.

    This also means that I am not looking for a dplyr solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table package IIRC.


    For additional information see these vignettes:

    解决方案

    Copying from Arun's updated answer here

    TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
    #    venue_id DFT_id New_id
    # 1:        1      1      3
    # 2:        2      1      3
    # 3:        1      2      4
    # 4:        3      2   9401
    # 5:        2      3      2
    # 6:        3      3    456
    

    His answer gives the details of what is going on.

    这篇关于条件二进制连接和更新通过引用使用data.table包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆