使用 `j` 选择 `x` 的连接列及其所有非连接列 [英] Use `j` to select the join column of `x` and all its non-join columns

查看:7
本文介绍了使用 `j` 选择 `x` 的连接列及其所有非连接列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据表:

库(data.table)d1 <- data.table(grp = c("a", "c", "b", "a"), val = c(2, 3, 6, 7), y1 = 1:4, y2 =5:8)d2 <- data.table(grp = rep(c("a", "b", "c"), 2),从 = rep(c(1, 5), each = 3), to = rep(c(4, 10), each = 3), z = 11:16)

我执行非 equi 连接,其中 'd1' 中的值 'val' 应落在 'd2' 中的 'from' 和 'to' 为每个组 'grp' 定义的范围内.

d1[d2, on = .(grp, val >= from, val <= to), nomatch = 0]# grp val y1 y2 val.1 z# 1: 1 1 5 4 11# 2:c 1 2 6 4 13# 3: 5 4 8 10 14# 4: b 5 3 7 10 15

在输出中,连接变量来自i('val'和'val.1','d2'中的值分别是'from'和'to').但是,我想改为使用 x 的连接列.现在,因为...

<块引用>

x 的列现在可以使用前缀 x. 来引用,并且在连接期间引用 x 的连接特别有用列,否则它们会被 i 所掩盖.

...这可以通过在 j 中指定 val = x.val 来实现:

d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]

为了避免在 j 中从 x 键入所有非连接列(可能很多),我目前的解决方法是将上述与原始数据连接起来,这给出了预期的结果:

d1[d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0], on = .(grp, val)]#grp val y1 y2 z# 1: 2 1 5 11# 2: c 3 2 6 13# 3: 7 4 8 14# 4: b 6 3 7 15

但是,这似乎有点笨拙.因此我的问题是:如何一次从 x 中选择连接列和 j 中的 x 中的所有非连接列?p><小时>

PS我考虑过切换xi数据集,以及on中的条件.尽管这会产生所需的连接值,但它仍然需要后处理(删除、重命名和重新排序列).

解决方案

PS 我考虑过切换 x 和 i 数据集,以及开启条件.虽然这会产生所需的连接值,但它仍然需要后处理(删除、重命名和重新排序列).

后处理的数量受 on= 列数的限制:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][,`:=`(val = from, from = NULL, to = NULL)][]

看起来还不错.

<小时>

根据@Jaap 的评论,这是另一种方法,使用更新连接将列添加到 d1:

nm2 = setdiff(names(d2), c("from","to","grp"))d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

这是有道理的,因为所需的输出本质上是 d1 加上来自 d2 的一些列(因为 d1 的每一行最多匹配一个d2 的行).

I have two data tables:

library(data.table)
d1 <- data.table(grp = c("a", "c", "b", "a"), val = c(2, 3, 6, 7), y1 = 1:4, y2 = 5:8)

d2 <- data.table(grp = rep(c("a", "b", "c"), 2),
                 from = rep(c(1, 5), each = 3), to = rep(c(4, 10), each = 3), z = 11:16)

I perform a non-equi join where the value 'val' in 'd1' should fall within the range defined by 'from' and 'to' in 'd2' for each group 'grp'.

d1[d2, on = .(grp, val >= from, val <= to), nomatch = 0]
#    grp val y1 y2 val.1  z
# 1:   a   1  1  5     4 11
# 2:   c   1  2  6     4 13
# 3:   a   5  4  8    10 14
# 4:   b   5  3  7    10 15

In the output, the join variables are from i ('val' and 'val.1', with the values of respectively 'from' and 'to' in 'd2'). However, I would like to have x's join column instead. Now, because...

Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's.

...this could be achieved by specifying val = x.val in j:

d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]

In order to avoid typing all non-join columns (possibly many) from x in j, my current work-around is to join the above with the original data, which gives the desired result:

d1[d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]
   , on = .(grp, val)]
#    grp val y1 y2  z
# 1:   a   2  1  5 11
# 2:   c   3  2  6 13
# 3:   a   7  4  8 14
# 4:   b   6  3  7 15

However, this seems a bit clumsy. Thus my question: how can I select the join column from x and all non-join columns from x in j in one go?


PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

解决方案

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

The amount of post processing is limited by how many on= cols there are:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][, 
  `:=`(val = from, from = NULL, to = NULL)][]

That doesn't seem too bad.


Following @Jaap's comment, here's another way, adding columns to d1 with an update join:

nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

This makes sense here because the desired output is essentially d1 plus some columns from d2 (since each row of d1 matches at most one row of d2).

这篇关于使用 `j` 选择 `x` 的连接列及其所有非连接列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆