是否可以使用data.table index-join-assignment idiom做一个左连接,并在i到x的非匹配行中分配NA? [英] Is it possible to use the data.table index-join-assignment idiom to do a left join and assign NAs in the non-matching rows of i to x?

查看:220
本文介绍了是否可以使用data.table index-join-assignment idiom做一个左连接,并在i到x的非匹配行中分配NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨天我给了这个答案:将数据表匹配五列,以更改另一列中的值



在评论中,OP询问我们是否可以有效地实现两个表的左连接,从而获得将导致右表被分配给左表的NA。在我看来,data.table没有提供任何方法这样做。



这里是我在这个问题中使用的例子:

  set.seed(1L); 
dt1< - data.table(id = 1:12,expand.grid(V1 = 1:3,V2 = 1:4),blah1 = rnorm(12L));
dt2< - data.table(id = 13:18,expand.grid(V1 = 1:2,V2 = 1:3),blah2 = rnorm(6L));
dt1;
## id V1 V2 blah1
## 1:1 1 1 -0.6264538
## 2:2 2 1 0.1836433
## 3:3 3 1 -0.8356286
## 4:4 1 2 1.5952808
## 5:5 2 2 0.3295078
## 6:6 3 2 -0.8204684
## 7:7 1 3 0.4874291
## 8:8 2 3 0.7383247
## 9:9 3 3 0.5757814
## 10:10 1 4 -0.3053884
## 11:11 2 4 1.5117812
## 12:12 3 4 0.3898432
dt2;
## id V1 V2 blah2
## 1:13 1 1 -0.62124058
## 2:14 2 1 -2.21469989
## 3:15 1 2 1.12493092
## 4:16 2 2 -0.04493361
## 5:17 1 3 -0.01619026
## 6:18 2 3 0.94383621
key< - paste0('V' 1:2);

这里是我给出的解决方案匹配行:

  dt1 [dt2,on = key,id:= i.id]; 
dt1;
## id V1 V2 blah1
## 1:13 1 1 -0.6264538
## 2:14 2 1 0.1836433
## 3:3 3 1 -0.8356286
## 4:15 1 2 1.5952808
## 5:16 2 2 0.3295078
## 6:6 3 2 -0.8204684
## 7:17 1 3 0.4874291
## 8:18 2 3 0.7383247
## 9:9 3 3 0.5757814
## 10:10 1 4 -0.3053884
## 11:11 2 4 1.5117812
## 12:12 3 4 0.3898432

我们需要的是 id 值将保留在 dt1 中以替换为NAs(不是,因为它们是12和以下, ,因为 dt2 中缺少这些id值,但因为列,即 V1 V2 ,不会导致 dt1中的行匹配 c> )。



正如我在该问题的意见中所说,解决方法是将 dt1 $ id 预分配给所有NAs,然后运行index-join-assignment。因此,这是预期的输出:

  dt1 $ id < 
dt1 [dt2,on = key,id:= i.id];
dt1;
## id V1 V2 blah1
## 1:13 1 1 -0.6264538
## 2:14 2 1 0.1836433
## 3:NA 3 1 -0.8356286
## 4:15 1 2 1.5952808
## 5:16 2 2 0.3295078
## 6:NA 3 2 -0.8204684
## 7:17 1 3 0.4874291
## 8:18 2 3 0.7383247
## 9:NA 3 3 0.5757814
## 10:NA 1 4 -0.3053884
## 11:NA 2 4 1.5117812
## 12:NA 3 4 0.3898432

我认为解决方法是确定的,但我不确定为什么data.table似乎不能通过索引联接分配操作在一次性中具有此功能。以下是我探讨的三个死胡同:



1: nomatch p>

data.table提供了一个 nomatch 参数,看起来有点像 all all.x all.y 。这实际上是一个非常有限的论点;它只允许从正确的连接( nomatch = NA ,默认)改为内部连接( nomatch = 0 )。

2:翻转 dt1 dt2



由于 dt1 [dt2] ,我们可以翻转它,意思是 dt2 [dt1] ,以实现相应的左连接。



不工作,因为我们需要使用 j 参数中的:= 就地赋值语法将 dt1 ,并且在翻转调用下,我们将分配给 dt2 。我尝试在flipped命令下分配 i.id ,但它不影响原始的 dt1

3:使用 merge.data.table()



我们可以使用 all.x = T merge.data.table() >参数来实现左连接。现在的问题是 merge.data.table()没有 j 参数,它只是没有提供方法分配左(或右)表的列。






因此,可以执行此操作在所有使用data.table?如果是,最好的方法是什么?

解决方案

AFAIU你只想查找 id从 dt2 dt1 当你加入 V1时,中的原始变量似乎与整个过程无关, V2 ,并且您不想在结果中具有 dt1 $ id 值。所以技术上正确的解决方法是不使用该列。



  set.seed (data.table)
dt1 < - data.table(id = 1:12,expand.grid(V1 = 1:3,V2 = 1:4),blah1 = rnorm(12L));
dt2< - data.table(id = 13:18,expand.grid(V1 = 1:2,V2 = 1:3),blah2 = rnorm(6L));
on = paste0(V,1:2)#我重命名为`on'以不掩盖`key`函数
dt1 [,id:= NULL
] [dt2,on = on,id:= i.id
] []
#V1 V2 blah1 id
#1:1 1 -0.6264538 13
#2:2 1 0.1836433 14
#3:3 1 -0.8356286 NA
#4:1 2 1.5952808 15
#5:2 2 0.3295078 16
#6:3 2 -0.8204684 NA
#7 :1 3 0.4874291 17
#8:2 3 0.7383247 18
#9:3 3 0.5757814 NA
#10:1 4 -0.3053884 NA
#11:2 4 1.5117812 NA
#12:3 4 0.3898432 NA

除了问题...

- 如果只有单个表达式来求值,则不必在行末使用;

- 使用 dt1 [,id:= NA_integer _] 而不是 dt1 $ id< - NA

- use set.seed 在提供代码 rnorm 和其他随机相关调用


Yesterday I gave this answer: Matching Data Tables by five columns to change a value in another column.

In the comments, the OP asked if we could effectively achieve a left join of the two tables and thereby get the NAs that would result in the right table to be assigned to the left table. It seems to me that data.table does not provide any means of doing this.

Here's the example case I used in that question:

set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
##     id V1 V2      blah1
##  1:  1  1  1 -0.6264538
##  2:  2  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4:  4  1  2  1.5952808
##  5:  5  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7:  7  1  3  0.4874291
##  8:  8  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432
dt2;
##    id V1 V2       blah2
## 1: 13  1  1 -0.62124058
## 2: 14  2  1 -2.21469989
## 3: 15  1  2  1.12493092
## 4: 16  2  2 -0.04493361
## 5: 17  1  3 -0.01619026
## 6: 18  2  3  0.94383621
key <- paste0('V',1:2);

And here's the solution I gave which does not get NAs for non-matching rows:

dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432

What we need is for the id values 12 and under that remain in dt1 to be replaced with NAs (not because they are 12 and under, and not because those id values are missing from dt2, but because the join on the key columns, namely V1 and V2, does not result in a match for those rows in dt1 against dt2).

As I said in the comments of that question, a workaround is to preassign dt1$id to all NAs, and then run the index-join-assignment. Hence, this is the expected output:

dt1$id <- NA;
dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3: NA  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6: NA  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9: NA  3  3  0.5757814
## 10: NA  1  4 -0.3053884
## 11: NA  2  4  1.5117812
## 12: NA  3  4  0.3898432

I think the workaround is ok, but I'm not sure why data.table doesn't seem to be capable of this functionality in one shot with an index-join-assign operation. The following are three dead-ends I explored:

1: nomatch

data.table provides a nomatch argument which looks a little bit like the all, all.x, and all.y arguments of merge(). This is actually a very limited argument; it only allows changing from a right join (nomatch=NA, the default) to an inner join (nomatch=0). We cannot achieve a left join with it.

2: flip dt1 and dt2

Since dt1[dt2] is a right join, we can just flip it, meaning dt2[dt1], to achieve the corresponding left join.

This will not work either because we need to use the := in-place assignment syntax in the j argument to assign into dt1, and under the flipped call, we'll instead be assigning to dt2. I tried assigning to i.id under the flipped command, but it didn't affect the original dt1.

3: use merge.data.table()

We can call merge.data.table() with the all.x=T argument to achieve a left join. The problem now is that merge.data.table() has no j argument, and it simply provides no means of assigning in-place a column of the left (or right) table.


So, is it possible to perform this operation at all using data.table? And if so, what's the best way to do it?

解决方案

AFAIU you just want to lookup id column from dt2 to dt1. Original id variable in dt1 seems to be unrelated to whole process as you join on V1,V2 and you don't want to have dt1$id values in the result. So technically correct way to address it is to not use that column at all.

set.seed(1)
library(data.table)
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
on = paste0("V",1:2) # I rename to `on` to not mask `key` function
dt1[,id:=NULL
    ][dt2,on=on,id:=i.id
      ][]
#  V1 V2      blah1 id
# 1:  1  1 -0.6264538 13
# 2:  2  1  0.1836433 14
# 3:  3  1 -0.8356286 NA
# 4:  1  2  1.5952808 15
# 5:  2  2  0.3295078 16
# 6:  3  2 -0.8204684 NA
# 7:  1  3  0.4874291 17
# 8:  2  3  0.7383247 18
# 9:  3  3  0.5757814 NA
#10:  1  4 -0.3053884 NA
#11:  2  4  1.5117812 NA
#12:  3  4  0.3898432 NA

Aside from the question...
- you don't have to use ; at the end of line if there is only single expression to evaluate
- use dt1[, id := NA_integer_] instead of dt1$id <- NA
- use set.seed when providing code with rnorm and other randomness related calls

这篇关于是否可以使用data.table index-join-assignment idiom做一个左连接,并在i到x的非匹配行中分配NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆