仅联接那些非NA列 [英] Join only those columns that are non-NA

查看:70
本文介绍了仅联接那些非NA列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中某些行的某些列包含NA:

I have a dataset with NAs in some rows for some columns:

DT <- data.table(ID=c(1, 2, 1:3), A=c(NA, NA, 1, NA, 3), B=c(4, 5, NA, 5, 6), C=c(7, 8, NA, NA, 9))
DT
#    ID  A  B  C
# 1:  1 NA  4  7
# 2:  2 NA  5  8
# 3:  1  1 NA NA
# 4:  2 NA  5 NA
# 5:  3  3  6  9

和参考表

ref <- data.table(ID=c(1, 1:3), A=c(1, 1:3), B=c(1, 4:6), C=c(1, 7, NA, 9), VAL=c(111, 101:103), VAL2=c(112, 104:106))
ref
#    ID A B  C VAL VAL2
# 1:  1 1 1  1 111  112
# 2:  1 1 4  7 101  104
# 3:  2 2 5 NA 102  105
# 4:  3 3 6  9 103  106

Qn:如何使用每行非NA列将 DT ref 联接在一起?

Qn: How do I left join DT with ref using non-NA columns for each row?

所需的输出(为强调分组而添加了换行符):

Desired output (newline added for emphasizing grouping):

   ID  A  B  C VAL VAL2
1:  1 NA  4  7 101  104

2:  2 NA  5  8  NA   NA

3:  1  1 NA NA 111  112
4:  1  1 NA NA 101  104

5:  2 NA  5 NA 102  105

6:  3  3  6  9 103  106

我尝试逐行执行行如下:

I tried to do it row-by-row as follows:

newcols <- c("VAL", "VAL2")
resLs <- lapply(split(DT, by="ID"), function(x) {
    #find those non-NA columns
    nonNACols <- names(x)[sapply(x, Negate(is.na))]

    #left join with ref table after subsetting the columns of ref table
    ref[, c(nonNACols, newcols), with=FALSE][x, on=nonNACols]
})

#combine the list of row results
ans <- rbindlist(resLs, use.names=TRUE, fill=TRUE)
setcolorder(ans, names(ref))
ans

如果解决方案可以做到,那会更好按某种组而不是逐行有什么建议吗?

It would be better if the solution can do it by some sort of groups rather than row by row. Any suggestions?

编辑:这么几个小时后终于确定了。通过分组使用data.table:

finally nailed it after so many hours. Using the data.table by grouping:

cols <- c("ID","A", "B", "C")
newcols <- c("VAL", "VAL2")
DT[, grp := paste(names(.SD)[sapply(.SD, Negate(is.na))], collapse=""), by=seq_len(nrow(DT)), .SDcols=cols]

rbindlist(
    DT[, {
        vec <- names(.SD)[sapply(.SD, function(x) !all(is.na(x)))]
        list(list(ref[.SD, on=vec,
            c(vec, newcols), with=FALSE]))
    }, by=.(grp)]$V1,
    use.names=TRUE, fill=TRUE)






编辑:另一种编码方式


another way to code it

cols <- c("ID","A", "B", "C")
newcols <- c("VAL", "VAL2")
DT[, grp := paste(names(.SD)[sapply(.SD, Negate(is.na))], collapse="_"),
    by=seq_len(nrow(DT)),
    .SDcols=cols]

setnames(DT[,
    ref[.SD, on=strsplit(.BY$grp, split="_")[[1L]], 
        c(paste0("i.", cols), paste0("x.",newcols)), with=FALSE], 
    by=.(grp)][,-1L], 
    c(cols, newcols))[]


推荐答案

一个选项将匹配 A = A或is.na(A)等,但是,我认为您不能使用 OR 条件合并 data.tables 。对于此类复杂的合并情况,我喜欢使用 sqldf 代替:

One option would be to match on something like A = A OR is.na(A), etc. However, I don't think you can use OR conditions to merge data.tables. For complicated merge situations like these, I like to use sqldf instead:

library(sqldf)
sqldf("SELECT l.*, r.VAL, r.VAL2
       FROM       DT as l
       LEFT JOIN  ref as r
       ON         l.ID = r.ID AND (l.A = r.A OR l.A IS NULL)
                  AND (l.B = r.B OR l.B IS NULL)
                  AND (l.C = r.C OR l.C IS NULL)
                  AND (l.A IS NOT NULL OR l.B IS NOT NULL OR l.C IS NOT NULL)")

#  ID  A  B  C VAL VAL2
#1  1 NA  4  7 101  104
#2  2 NA  5  8  NA   NA
#3  1  1 NA NA 111  112
#4  1  1 NA NA 101  104
#5  2 NA  5 NA 102  105
#6  3  3  6  9 103  106

请注意,最后一个条件可以确保如果您的所有 A,B,C NA ,则它将不匹配任何行。

Note that the last condition insures that if all of your A, B, C are NA then it won't match any rows.

这篇关于仅联接那些非NA列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆