仅联接那些非NA列 [英] Join only those columns that are non-NA
问题描述
我有一个数据集,其中某些行的某些列包含NA:
I have a dataset with NAs in some rows for some columns:
DT <- data.table(ID=c(1, 2, 1:3), A=c(NA, NA, 1, NA, 3), B=c(4, 5, NA, 5, 6), C=c(7, 8, NA, NA, 9))
DT
# ID A B C
# 1: 1 NA 4 7
# 2: 2 NA 5 8
# 3: 1 1 NA NA
# 4: 2 NA 5 NA
# 5: 3 3 6 9
和参考表
ref <- data.table(ID=c(1, 1:3), A=c(1, 1:3), B=c(1, 4:6), C=c(1, 7, NA, 9), VAL=c(111, 101:103), VAL2=c(112, 104:106))
ref
# ID A B C VAL VAL2
# 1: 1 1 1 1 111 112
# 2: 1 1 4 7 101 104
# 3: 2 2 5 NA 102 105
# 4: 3 3 6 9 103 106
Qn:如何使用每行非NA列将 DT
与 ref
联接在一起?
Qn: How do I left join DT
with ref
using non-NA columns for each row?
所需的输出(为强调分组而添加了换行符):
Desired output (newline added for emphasizing grouping):
ID A B C VAL VAL2
1: 1 NA 4 7 101 104
2: 2 NA 5 8 NA NA
3: 1 1 NA NA 111 112
4: 1 1 NA NA 101 104
5: 2 NA 5 NA 102 105
6: 3 3 6 9 103 106
我尝试逐行执行行如下:
I tried to do it row-by-row as follows:
newcols <- c("VAL", "VAL2")
resLs <- lapply(split(DT, by="ID"), function(x) {
#find those non-NA columns
nonNACols <- names(x)[sapply(x, Negate(is.na))]
#left join with ref table after subsetting the columns of ref table
ref[, c(nonNACols, newcols), with=FALSE][x, on=nonNACols]
})
#combine the list of row results
ans <- rbindlist(resLs, use.names=TRUE, fill=TRUE)
setcolorder(ans, names(ref))
ans
如果解决方案可以做到,那会更好按某种组而不是逐行有什么建议吗?
It would be better if the solution can do it by some sort of groups rather than row by row. Any suggestions?
编辑:这么几个小时后终于确定了。通过分组使用data.table:
finally nailed it after so many hours. Using the data.table by grouping:
cols <- c("ID","A", "B", "C")
newcols <- c("VAL", "VAL2")
DT[, grp := paste(names(.SD)[sapply(.SD, Negate(is.na))], collapse=""), by=seq_len(nrow(DT)), .SDcols=cols]
rbindlist(
DT[, {
vec <- names(.SD)[sapply(.SD, function(x) !all(is.na(x)))]
list(list(ref[.SD, on=vec,
c(vec, newcols), with=FALSE]))
}, by=.(grp)]$V1,
use.names=TRUE, fill=TRUE)
编辑:另一种编码方式
another way to code it
cols <- c("ID","A", "B", "C")
newcols <- c("VAL", "VAL2")
DT[, grp := paste(names(.SD)[sapply(.SD, Negate(is.na))], collapse="_"),
by=seq_len(nrow(DT)),
.SDcols=cols]
setnames(DT[,
ref[.SD, on=strsplit(.BY$grp, split="_")[[1L]],
c(paste0("i.", cols), paste0("x.",newcols)), with=FALSE],
by=.(grp)][,-1L],
c(cols, newcols))[]
推荐答案
一个选项将匹配 A = A或is.na(A)
等,但是,我认为您不能使用 OR
条件合并 data.tables
。对于此类复杂的合并情况,我喜欢使用 sqldf
代替:
One option would be to match on something like A = A OR is.na(A)
, etc. However, I don't think you can use OR
conditions to merge data.tables
. For complicated merge situations like these, I like to use sqldf
instead:
library(sqldf)
sqldf("SELECT l.*, r.VAL, r.VAL2
FROM DT as l
LEFT JOIN ref as r
ON l.ID = r.ID AND (l.A = r.A OR l.A IS NULL)
AND (l.B = r.B OR l.B IS NULL)
AND (l.C = r.C OR l.C IS NULL)
AND (l.A IS NOT NULL OR l.B IS NOT NULL OR l.C IS NOT NULL)")
# ID A B C VAL VAL2
#1 1 NA 4 7 101 104
#2 2 NA 5 8 NA NA
#3 1 1 NA NA 111 112
#4 1 1 NA NA 101 104
#5 2 NA 5 NA 102 105
#6 3 3 6 9 103 106
请注意,最后一个条件可以确保如果您的所有 A,B,C
是 NA
,则它将不匹配任何行。
Note that the last condition insures that if all of your A, B, C
are NA
then it won't match any rows.
这篇关于仅联接那些非NA列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!