查找重复项,比较条件,用NA R擦除一行 [英] find duplicate, compare a condition, erase one row - with NAs R
问题描述
我基于这个问题查找重复项,比较条件,擦除一行r
以解决更复杂的情况。
I am building upon this question find duplicate, compare a condition, erase one row r to solve a more complicated case.
使用以下可重复的示例:
Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1",NA,"a3", "a2","a2", "a8", "a9", "a9")
ID2<-c("b8","b99","b5","b5","b2","b8" , "b7","b7", "b6","b6",NA,"b9",NA)
Value1<-c(2,5,6,6,2,7, NA,5,NA,4,4,6,6)
Value2<- c(23,51,63,64,23,23,5,6,4,NA,NA,4,NA)
Year<- c(2004,2004,2004,2004,2005,2004,2008,2009, 2008,2009,2014,2016,2016)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
我要选择ID1和ID2和Year在它们的值相同的行各自的列。对于此行,我想比较重复行中的Value1和Value2,如果值不相同,则将列值较高的行擦除(由于数据结构,这将是
)。
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with within the column higher value (because of the data structure this will be unambiguous ).
预期结果:
# ID1 ID2 Value1 Value2 Year
# 1 a1 b8 2 23 2004
# 2 a4 b99 5 51 2004
# 3 a6 b5 6 63 2004
# 5 a5 b2 2 23 2005
# 7 <NA> b7 NA 5 2008
# 8 a3 b7 5 6 2009
# 9 a2 b6 NA 4 2008
# 10 a2 b6 4 NA 2009
# 11 a8 <NA> 4 NA 2014
# 12 a9 b9 6 4 2016
第一个解决方案:
df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)
问题:当其中一个ID为NA时,它会删除原始文件
PROBLEM: it deletes raws when one of the IDs is NA
然后我将NA更改为字符值
I then changed NAs to a character value
df$ID1[is.na(df$ID1)] <- "Missing_data"
df$ID2[is.na(df$ID2)] <- "Missing_data"
df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)
我解决了上一个问题,但创建了第二个问题。
I solve the previous problem but I create a second one.
问题:当一年中不存在NA且其中一个ID的ID(df中的最后两行)时,它具有重复的ID。
PROBLEM: it has IDs duplicates when in a single year there are NA AND the ID for one of the IDs (last 2 lines in df)
推荐答案
这是 dplyr
解决方案:
library(dplyr)
df %>%
arrange(Value2) %>%
distinct(ID1, ID2, Year, .keep_all = T) %>%
arrange(ID2) %>%
distinct(ID1, Year, .keep_all = T) %>%
arrange(ID1) %>%
distinct(ID2, Year, .keep_all = T)
# ID1 ID2 Value1 Value2 Year
# 1 a1 b8 2 23 2004
# 2 a2 b6 NA 4 2008
# 3 a2 b6 4 NA 2009
# 4 a3 b7 5 6 2009
# 5 a4 b99 5 51 2004
# 6 a5 b2 2 23 2005
# 7 a6 b5 6 63 2004
# 8 a8 <NA> 4 NA 2014
# 9 a9 b9 6 4 2016
# 10 <NA> b7 NA 5 2008
当我们按 Value2
较小的值 Value
将位于顶部,而 distinct
将删除所有重复项并保留其找到的第一行(
When we arrange by Value2
the smaller values of Value
will be on top and distinct
will remove any duplicates and keep the 1st row it finds (i.e. the one with the smaller Value2
).
当我们按 ID1排列时> code>,然后是
ID2
, NA
值将位于底部,而则与众不同
如果它们重复则将排除它们。
When we arrange by ID1
and then ID2
the NA
values will be on the bottom and distinct
will exclude them if they are duplicates.
注意我只使用了 Value2
保留较小的值,因为我仍然不清楚您所说的值是什么意思。
Note that I'm using only Value2
to keep small values, as it's still not clear to me what you mean by "value".
这篇关于查找重复项,比较条件,用NA R擦除一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!