使用`j`选择`x`的联接列及其所有非联接列 [英] Use `j` to select the join column of `x` and all its non-join columns

查看:37
本文介绍了使用`j`选择`x`的联接列及其所有非联接列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据表:

 库(data.table)d1<-data.table(grp = c("a","c","b","a"),val = c(2,3,6,7),y1 = 1:4,y2 =5:8)d2<-data.table(grp = rep(c("a","b","c"),2),从= rep(c(1,5),每个= 3),到= rep(c(4,10),每个= 3),z = 11:16) 

我执行非等值联接,其中对于每个组"grp","d1"中的"val"值应落在"d2"中的"from"和"to"定义的范围内.

  d1 [d2,on =.(grp,val> =从,val< =到),不匹配= 0]#grp val y1 y2 val.1 z#1:a 1 1 5 4 11#2:c 1 2 6 4 13#3:a 5 4 8 10 14#4:b 5 3 7 10 15 

在输出中,联接变量来自 i ("val"和"val.1",其值分别在"d2"中为"from"和"to").但是,我想改用 x 的join列.现在,因为...

x 的列现在可以使用前缀 x引用.,并且在联接期间引用 x 的联接特别有用列,否则会被 i 遮盖.

...这可以通过在 j 中指定 val = x.val 来实现:

  d1 [d2,.(grp,val = x.val,z),on =.(grp,val> = from,val <= to),nomatch = 0] 

为了避免在 j 中的 x 中键入所有非联接列(可能很多),我目前的解决方法是将上述内容与原始数据联接在一起,可以得到理想的结果:

  d1 [d1 [d2,.(grp,val = x.val,z),on =.(grp,val> = from,val <= to),nomatch = 0],on =.(grp,val)]#grp val y1 y2 z#1:a 2 1 5 11#2:c 3 2 6 13#3:a 7 4 8 14#4:b 6 3 7 15 

但是,这似乎有点笨拙.因此,我的问题是:如何一次性从 x 中选择列,并从 j 中的 x 中选择所有非联接列?>


PS我已经考虑过切换 x i 数据集以及 on 中的条件.尽管这会产生所需的联接值,但仍需要后处理(列的删除,重命名和重新排序).

解决方案

PS我已经考虑过切换x和i数据集,以及打开条件.尽管这会产生所需的联接值,但仍需要后处理(删除,重命名和重新排序列).

后处理的数量受有多少 on = 个cols限制:

  d2 [d1,on =.(grp,从< = val,到> = val),nomatch = 0] [,`:=`(val = from,from = NULL,to = NULL)] [] 

这看起来还不错.


在@Jaap的评论之后,这是另一种方法,通过更新联接将列添加到 d1 :

  nm2 = setdiff(names(d2),c("from","to","grp"))d1 [d2,on =.(grp,val> = from,val <= to),(nm2):= mget(sprintf("i.%s",nm2))] 

这在这里很有意义,因为所需的输出本质上是 d1 加上 d2 中的一些列(因为 d1 的每一行最多匹配一个 d2 的行).

I have two data tables:

library(data.table)
d1 <- data.table(grp = c("a", "c", "b", "a"), val = c(2, 3, 6, 7), y1 = 1:4, y2 = 5:8)

d2 <- data.table(grp = rep(c("a", "b", "c"), 2),
                 from = rep(c(1, 5), each = 3), to = rep(c(4, 10), each = 3), z = 11:16)

I perform a non-equi join where the value 'val' in 'd1' should fall within the range defined by 'from' and 'to' in 'd2' for each group 'grp'.

d1[d2, on = .(grp, val >= from, val <= to), nomatch = 0]
#    grp val y1 y2 val.1  z
# 1:   a   1  1  5     4 11
# 2:   c   1  2  6     4 13
# 3:   a   5  4  8    10 14
# 4:   b   5  3  7    10 15

In the output, the join variables are from i ('val' and 'val.1', with the values of respectively 'from' and 'to' in 'd2'). However, I would like to have x's join column instead. Now, because...

Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's.

...this could be achieved by specifying val = x.val in j:

d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]

In order to avoid typing all non-join columns (possibly many) from x in j, my current work-around is to join the above with the original data, which gives the desired result:

d1[d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]
   , on = .(grp, val)]
#    grp val y1 y2  z
# 1:   a   2  1  5 11
# 2:   c   3  2  6 13
# 3:   a   7  4  8 14
# 4:   b   6  3  7 15

However, this seems a bit clumsy. Thus my question: how can I select the join column from x and all non-join columns from x in j in one go?


PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

解决方案

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

The amount of post processing is limited by how many on= cols there are:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][, 
  `:=`(val = from, from = NULL, to = NULL)][]

That doesn't seem too bad.


Following @Jaap's comment, here's another way, adding columns to d1 with an update join:

nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

This makes sense here because the desired output is essentially d1 plus some columns from d2 (since each row of d1 matches at most one row of d2).

这篇关于使用`j`选择`x`的联接列及其所有非联接列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆