是否可以使用 data.table index-join-assignment 惯用语进行左连接并在 i 到 x 的不匹配行中分配 NA? [英] Is it possible to use the data.table index-join-assignment idiom to do a left join and assign NAs in the non-matching rows of i to x?

查看:17
本文介绍了是否可以使用 data.table index-join-assignment 惯用语进行左连接并在 i 到 x 的不匹配行中分配 NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨天我给出了这个答案:按五列匹配数据表以更改另一列中的值.

Yesterday I gave this answer: Matching Data Tables by five columns to change a value in another column.

在评论中,OP 询问我们是否可以有效地实现两个表的左连接,从而获得将导致右表分配给左表的 NA.在我看来,data.table 没有提供任何方法来做到这一点.

In the comments, the OP asked if we could effectively achieve a left join of the two tables and thereby get the NAs that would result in the right table to be assigned to the left table. It seems to me that data.table does not provide any means of doing this.

这是我在该问题中使用的示例案例:

Here's the example case I used in that question:

set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
##     id V1 V2      blah1
##  1:  1  1  1 -0.6264538
##  2:  2  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4:  4  1  2  1.5952808
##  5:  5  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7:  7  1  3  0.4874291
##  8:  8  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432
dt2;
##    id V1 V2       blah2
## 1: 13  1  1 -0.62124058
## 2: 14  2  1 -2.21469989
## 3: 15  1  2  1.12493092
## 4: 16  2  2 -0.04493361
## 5: 17  1  3 -0.01619026
## 6: 18  2  3  0.94383621
key <- paste0('V',1:2);

这是我提供的解决方案,它不会为不匹配的行获取 NA:

And here's the solution I gave which does not get NAs for non-matching rows:

dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432

我们需要的是将 id 值 12 及以下留在 dt1 中的值替换为 NA(不是,因为它们是 12和下,不是,因为dt2中缺少这些id值,而是因为key列上的连接,即V1V2,不会导致 dt1 中的那些行与 dt2 匹配.

What we need is for the id values 12 and under that remain in dt1 to be replaced with NAs (not because they are 12 and under, and not because those id values are missing from dt2, but because the join on the key columns, namely V1 and V2, does not result in a match for those rows in dt1 against dt2).

正如我在该问题的评论中所说,解决方法是将 dt1$id 预分配给所有 NA,然后运行索引连接分配.因此,这是预期的输出:

As I said in the comments of that question, a workaround is to preassign dt1$id to all NAs, and then run the index-join-assignment. Hence, this is the expected output:

dt1$id <- NA;
dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3: NA  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6: NA  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9: NA  3  3  0.5757814
## 10: NA  1  4 -0.3053884
## 11: NA  2  4  1.5117812
## 12: NA  3  4  0.3898432

我认为解决方法没问题,但我不确定为什么 data.table 似乎无法通过索引连接分配操作一次性完成此功能.以下是我探索过的三个死胡同:

I think the workaround is ok, but I'm not sure why data.table doesn't seem to be capable of this functionality in one shot with an index-join-assign operation. The following are three dead-ends I explored:

1: nomatch

data.table 提供了一个 nomatch 参数,它看起来有点像 allall.xall.merge() 的 y 参数.这实际上是一个非常有限的论点;它只允许从右连接(nomatch=NA,默认值)更改为内连接(nomatch=0).我们无法实现与它的左连接.

data.table provides a nomatch argument which looks a little bit like the all, all.x, and all.y arguments of merge(). This is actually a very limited argument; it only allows changing from a right join (nomatch=NA, the default) to an inner join (nomatch=0). We cannot achieve a left join with it.

2:翻转dt1dt2

由于dt1[dt2]是一个右连接,我们可以直接翻转它,即dt2[dt1],实现对应的左连接.

Since dt1[dt2] is a right join, we can just flip it, meaning dt2[dt1], to achieve the corresponding left join.

这也行不通,因为我们需要在 j 参数中使用 := 就地赋值语法来赋值给 dt1,在翻转调用下,我们将改为分配给 dt2.我尝试在翻转命令下分配给i.id,但它并没有影响原来的dt1.

This will not work either because we need to use the := in-place assignment syntax in the j argument to assign into dt1, and under the flipped call, we'll instead be assigning to dt2. I tried assigning to i.id under the flipped command, but it didn't affect the original dt1.

3:使用merge.data.table()

我们可以使用 all.x=T 参数调用 merge.data.table() 来实现左连接.现在的问题是 merge.data.table() 没有 j 参数,它根本没有提供就地分配左侧(或右侧)列的方法) 表.

We can call merge.data.table() with the all.x=T argument to achieve a left join. The problem now is that merge.data.table() has no j argument, and it simply provides no means of assigning in-place a column of the left (or right) table.

那么,有没有可能使用data.table来执行这个操作?如果是这样,最好的方法是什么?

So, is it possible to perform this operation at all using data.table? And if so, what's the best way to do it?

推荐答案

AFAIU you just want to look up id column from dt2 to dt1.dt1 中的原始 id 变量似乎与整个过程无关,因为您加入了 V1,V2 并且您不想拥有 dt1$id 结果中的值.因此,从技术上讲,解决它的正确方法是根本不使用该列.

AFAIU you just want to lookup id column from dt2 to dt1. Original id variable in dt1 seems to be unrelated to whole process as you join on V1,V2 and you don't want to have dt1$id values in the result. So technically correct way to address it is to not use that column at all.

set.seed(1)
library(data.table)
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
on = paste0("V",1:2) # I rename to `on` to not mask `key` function
dt1[,id:=NULL
    ][dt2,on=on,id:=i.id
      ][]
#  V1 V2      blah1 id
# 1:  1  1 -0.6264538 13
# 2:  2  1  0.1836433 14
# 3:  3  1 -0.8356286 NA
# 4:  1  2  1.5952808 15
# 5:  2  2  0.3295078 16
# 6:  3  2 -0.8204684 NA
# 7:  1  3  0.4874291 17
# 8:  2  3  0.7383247 18
# 9:  3  3  0.5757814 NA
#10:  1  4 -0.3053884 NA
#11:  2  4  1.5117812 NA
#12:  3  4  0.3898432 NA

除了问题...
- 如果只有单个表达式要计算,则不必在行尾使用 ;
- 使用 dt1[, id := NA_integer_] 而不是 dt1$id <- NA
- 在提供带有 rnorm 和其他随机相关调用的代码时使用 set.seed

Aside from the question...
- you don't have to use ; at the end of line if there is only single expression to evaluate
- use dt1[, id := NA_integer_] instead of dt1$id <- NA
- use set.seed when providing code with rnorm and other randomness related calls

这篇关于是否可以使用 data.table index-join-assignment 惯用语进行左连接并在 i 到 x 的不匹配行中分配 NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆