R data.table删除如果另一列不适用的情况下重复一列的行 [英] R data.table remove rows where one column is duplicated if another column is NA

查看:54
本文介绍了R data.table删除如果另一列不适用的情况下重复一列的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是示例数据表。

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))

   col1   col2
1:    A     NA
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

我要删除如果col2为NA并且非-NA值在另一行中。按col1对AKA进行分组,然后,如果该分组具有多于一行并且其中之一为NA,则将其删除。这将是 dt 的结果:

I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. AKA group by col1, then if group has more than one row and one of them is NA, remove it. This would be the result for dt:

   col1   col2
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

我尝试过这个:

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]

   col1 col2
1:    A  dog
2:    B  cat
3:    C jeep
4:    D   NA

我想念什么?谢谢

推荐答案

尝试在组中查找所有 NA 个案例还有一个非 NA 值,然后删除这些行:

An attempt to find all the NA cases in groups where there is also a non-NA value, and then remove those rows:

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

似乎更快我确定有人很快就会提出一个更快的版本:

Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly:

set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#   user  system elapsed 
#   1.49    0.02    1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#   user  system elapsed 
#   4.49    0.04    4.54 

这篇关于R data.table删除如果另一列不适用的情况下重复一列的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆