匹配两个data.tables的行以填充data.table的子集 [英] match rows of two data.tables to fill subset of a data.table
问题描述
我有两个数据表:
> DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c")
, col2 = c("b","d","c","a","d","a","c","a")
, col3 = c(1,2,3,4,5,6,7,8))
> DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a")
, col2 = c("d","b","c","d","a","a","c","a")
, col3 = c(NA,1,2,NA,6,NA,3,NA))
> DT1
col1 col2 col3
1: a b 1
2: b d 2
3: b c 3
4: a a 4
5: c d 5
6: b a 6
7: a c 7
8: c a 8
> DT2
col1 col2 col3
1: b d NA
2: e b 1
3: c c 2
4: e d NA
5: b a 6
6: c a NA
7: d c 3
8: a a NA
我想使用col1和col2将col3为NA的DT2行与DT1的行进行匹配,如果存在匹配,用DT1中的行填充DT2中col3的NA值.
I'd like to match the rows of DT2 for which col3 is NA to those of DT1, using col1 and col2 and fill the NA values of col3 in DT2 with those in DT1, if match exists.
> #desired Output
> DT2_output
col1 col2 col3
1: b d 2
2: e b 1
3: c c 2
4: e d NA
5: b a 6
6: c a 8
7: d c 3
8: a a 4
我如何使用简洁的data.table操作(无循环)来执行此操作,因为每个data.table中有数百万行. 我尝试了以下操作,但它给了我错误,我认为这与what语句有关.
how do I do this using succinct data.table operation (no loops), as there are millions of rows in each data.table. I tried the following and it gives me error, which I believe has something to do with the which statements.
> ##doesn't work
> DT2[is.na(col3), col3 := DT1[which(col1 == DT2[is.na(col3),col1] && col2 == DT2[is.na(col3), col2]), col3]]
推荐答案
我可以直接进行左连接,然后基于是否缺少原始col3进行ifelse条件,如下所示:
I could do directly a left join and then ifelse condition based on if the original col3 is missing like follows:
DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]
# col1 col2 col3
#1: a a 4
#2: b a 6
#3: b d 2
#4: c a 8
#5: c c 2
#6: d c 3
#7: e b 1
#8: e d NA
或者,一种更有效的方法是通过引用进行操作,该操作直接修改(通过引用代替)DT2数据.表:
Alternatively, a more efficient way is to do the operation by reference, which directly modifies (in place by reference) the DT2 data.table :
DT2[DT1, on = .(col1, col2), col3 := i.col3]
# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
i.col3
中的i
引用了DT[i, j, by]
中的i,因此引用了DT1
中的列col3
.之所以可行,是因为在任何情况下,两个data.tables中的col1和col2列都不会丢失.如果遇到这种情况,您可以执行以下更通用的操作(包括data.tables DT1和DT2的示例):
The i
in i.col3
refers to the i as in DT[i, j, by]
, hence to the column col3
in DT1
. This works because you do not have any cases when a non missing match for columns col1 and col2 in both data.tables would occur. If you have such cases you could do a more generic operation as follows (including examples of data.tables DT1 and DT2):
DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"),
col2 = c("b","d","c","a","d","a","c","a", "b"),
col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
col2 = c("d","b","c","d","a","a","c","a"),
col3 = c(NA,1,2,NA,6,NA,3,NA))
您可以看到DT1
的col1 = "e"
和col2 = "b"
值为22.DT2
的col1 = "e"
和col2 = "b"
值为1.为了优先使用
You can see that DT1
has for col1 = "e"
and col2 = "b"
a value of 22. The DT2
has for col1 = "e"
and col2 = "b"
a value of 1. In order to give preference to DT2
if such clash occurs you would do:
DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
哪个给你
# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
x.col3
中的x.
指的是DT2
中的col3
.
矢量化data.table操作方法是一种更有效的通用方法(如果DT1自己包含一对col1和col2对的值)
A more efficient generic method is the vectorised data.table operation approach (in case that DT1 contains values for a certain pair of col1 and col2 on its own)
鉴于有用的评论(@Frank& @ chinsoon12),我再次检查了提供的解决方案.正如指出的那样,ifelse可能会变慢(注释中的原因),这就是为什么向量化解决方案是更好的选择的原因:
Given the usefull comments (@Frank & @chinsoon12), I double checked on the provided solution. As pointed out, the ifelse can get slow (reasons in the comment) and that is why the vectorised solution is a better way to go:
dt1 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
col3 = rep(c(3,1,2,4,6,5,3,6,2,NA), times = 26000))
dt2 <- data.table(col1 = paste0(rep(letters[1:26], times = 10000), rep(c(1:10000), each = 26)),
col2 = paste0(rep(letters[26:1], times = 10000), rep(c(1:10000), each = 26)),
col3 = rep(c(10,NA,22,51,60,NA,32,NA,28,NA), times = 26000))
dt2alt <- copy(dt2)
microbenchmark::microbenchmark(
NoIfelse = dt2[is.na(col3), col3 := dt1[dt2[is.na(col3)], on = .(col1, col2), col3]],
Ifelse = dt2alt[dt1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# NoIfelse 94.06635 100.2841 107.4140 103.7807 107.4006 154.7868 100 a
# Ifelse 219.94923 243.2651 254.3271 250.6967 266.8300 306.7213 100 b
identical(dt2, dt2alt) #TRUE
这篇关于匹配两个data.tables的行以填充data.table的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!