匹配两个data.tables的行以填充data.table的子集 [英] match rows of two data.tables to fill subset of a data.table

查看:69
本文介绍了匹配两个data.tables的行以填充data.table的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据表:

> DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c")
                  , col2 = c("b","d","c","a","d","a","c","a")
                  , col3 = c(1,2,3,4,5,6,7,8))
> DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a")
                  , col2 = c("d","b","c","d","a","a","c","a")
                  , col3 = c(NA,1,2,NA,6,NA,3,NA))

> DT1
   col1 col2 col3
1:    a    b    1
2:    b    d    2
3:    b    c    3
4:    a    a    4
5:    c    d    5
6:    b    a    6
7:    a    c    7
8:    c    a    8

> DT2
   col1 col2 col3
1:    b    d   NA
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a   NA
7:    d    c    3
8:    a    a   NA

我想使用col1和col2将col3为NA的DT2行与DT1的行进行匹配,如果存在匹配,用DT1中的行填充DT2中col3的NA值.

I'd like to match the rows of DT2 for which col3 is NA to those of DT1, using col1 and col2 and fill the NA values of col3 in DT2 with those in DT1, if match exists.

> #desired Output
> DT2_output
   col1 col2 col3
1:    b    d    2
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a    8
7:    d    c    3
8:    a    a    4

我如何使用简洁的data.table操作(无循环)来执行此操作,因为每个data.table中有数百万行. 我尝试了以下操作,但它给了我错误,我认为这与what语句有关.

how do I do this using succinct data.table operation (no loops), as there are millions of rows in each data.table. I tried the following and it gives me error, which I believe has something to do with the which statements.

> ##doesn't work
> DT2[is.na(col3), col3 := DT1[which(col1 == DT2[is.na(col3),col1] && col2 == DT2[is.na(col3), col2]), col3]]

推荐答案

我可以直接进行左连接,然后基于是否缺少原始col3进行ifelse条件,如下所示:

I could do directly a left join and then ifelse condition based on if the original col3 is missing like follows:

DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]

#   col1 col2 col3
#1:    a    a    4
#2:    b    a    6
#3:    b    d    2
#4:    c    a    8
#5:    c    c    2
#6:    d    c    3
#7:    e    b    1
#8:    e    d   NA

或者,一种更有效的方法是通过引用进行操作,该操作直接修改(通过引用代替)DT2数据.表:

Alternatively, a more efficient way is to do the operation by reference, which directly modifies (in place by reference) the DT2 data.table :

DT2[DT1, on = .(col1, col2), col3 := i.col3]

#   col1 col2 col3
#1:    b    d    2
#2:    e    b    1
#3:    c    c    2
#4:    e    d   NA
#5:    b    a    6
#6:    c    a    8
#7:    d    c    3
#8:    a    a    4

i.col3中的i引用了DT[i, j, by]中的i,因此引用了DT1中的列col3.之所以可行,是因为在任何情况下,两个data.tables中的col1和col2列都不会丢失.如果遇到这种情况,您可以执行以下更通用的操作(包括data.tables DT1和DT2的示例):

The i in i.col3 refers to the i as in DT[i, j, by], hence to the column col3 in DT1. This works because you do not have any cases when a non missing match for columns col1 and col2 in both data.tables would occur. If you have such cases you could do a more generic operation as follows (including examples of data.tables DT1 and DT2):

DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"), 
                  col2 = c("b","d","c","a","d","a","c","a", "b"),
                  col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
                  col2 = c("d","b","c","d","a","a","c","a"),
                  col3 = c(NA,1,2,NA,6,NA,3,NA))

您可以看到DT1col1 = "e"col2 = "b"值为22.DT2col1 = "e"col2 = "b"值为1.为了优先使用如果发生此类冲突,您可以这样做:

You can see that DT1 has for col1 = "e" and col2 = "b" a value of 22. The DT2 has for col1 = "e" and col2 = "b" a value of 1. In order to give preference to DT2 if such clash occurs you would do:

DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]

哪个给你

#   col1 col2 col3
#1:    b    d    2
#2:    e    b    1
#3:    c    c    2
#4:    e    d   NA
#5:    b    a    6
#6:    c    a    8
#7:    d    c    3
#8:    a    a    4

x.col3中的x.指的是DT2中的col3.

矢量化data.table操作方法是一种更有效的通用方法(如果DT1自己包含一对col1和col2对的值)

A more efficient generic method is the vectorised data.table operation approach (in case that DT1 contains values for a certain pair of col1 and col2 on its own)

鉴于有用的评论(@Frank& @ chinsoon12),我再次检查了提供的解决方案.正如指出的那样,ifelse可能会变慢(注释中的原因),这就是为什么向量化解决方案是更好的选择的原因:

Given the usefull comments (@Frank & @chinsoon12), I double checked on the provided solution. As pointed out, the ifelse can get slow (reasons in the comment) and that is why the vectorised solution is a better way to go:

dt1 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
                  col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
                  col3 = rep(c(3,1,2,4,6,5,3,6,2,NA), times = 26000))

dt2 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
                  col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
                  col3 = rep(c(10,NA,22,51,60,NA,32,NA,28,NA), times = 26000))
dt2alt <- copy(dt2)

microbenchmark::microbenchmark(
NoIfelse = dt2[is.na(col3), col3 := dt1[dt2[is.na(col3)], on = .(col1, col2), col3]],
Ifelse = dt2alt[dt1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
)

#Unit: milliseconds
#     expr       min       lq     mean   median       uq      max neval cld
# NoIfelse  94.06635 100.2841 107.4140 103.7807 107.4006 154.7868   100  a 
#   Ifelse 219.94923 243.2651 254.3271 250.6967 266.8300 306.7213   100   b

identical(dt2, dt2alt) #TRUE

这篇关于匹配两个data.tables的行以填充data.table的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆