查找重复项,比较条件,用NA R擦除一行 [英] find duplicate, compare a condition, erase one row - with NAs R

查看:109
本文介绍了查找重复项,比较条件,用NA R擦除一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我基于这个问题查找重复项,比较条件,擦除一行r
以解决更复杂的情况。

I am building upon this question find duplicate, compare a condition, erase one row r to solve a more complicated case.

使用以下可重复的示例:

Using the following reproducible example:

ID1<-c("a1","a4","a6","a6","a5", "a1",NA,"a3", "a2","a2", "a8", "a9", "a9")
ID2<-c("b8","b99","b5","b5","b2","b8" , "b7","b7", "b6","b6",NA,"b9",NA)
Value1<-c(2,5,6,6,2,7, NA,5,NA,4,4,6,6)
Value2<- c(23,51,63,64,23,23,5,6,4,NA,NA,4,NA)
Year<- c(2004,2004,2004,2004,2005,2004,2008,2009, 2008,2009,2014,2016,2016)
df<-data.frame(ID1,ID2,Value1,Value2,Year)

我要选择ID1和ID2和Year在它们的值相同的行各自的列。对于此行,我想比较重复行中的Value1和Value2,如果值不相同,则将列值较高的行擦除(由于数据结构,这将是
)。

I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with within the column higher value (because of the data structure this will be unambiguous ).

预期结果:

#     ID1  ID2 Value1 Value2 Year
# 1    a1   b8      2     23 2004
# 2    a4  b99      5     51 2004
# 3    a6   b5      6     63 2004

# 5    a5   b2      2     23 2005

# 7  <NA>   b7     NA      5 2008
# 8    a3   b7      5      6 2009
# 9    a2   b6     NA      4 2008
# 10   a2   b6      4     NA 2009
# 11   a8 <NA>      4     NA 2014
# 12   a9   b9      6      4 2016

第一个解决方案:

df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)

问题:当其中一个ID为NA时,它会删除原始文件

PROBLEM: it deletes raws when one of the IDs is NA

然后我将NA更改为字符值

I then changed NAs to a character value

df$ID1[is.na(df$ID1)] <- "Missing_data"
df$ID2[is.na(df$ID2)] <- "Missing_data"

df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)

我解决了上一个问题,但创建了第二个问题。

I solve the previous problem but I create a second one.

问题:当一年中不存在NA且其中一个ID的ID(df中的最后两行)时,它具有重复的ID。

PROBLEM: it has IDs duplicates when in a single year there are NA AND the ID for one of the IDs (last 2 lines in df)

推荐答案

这是 dplyr 解决方案:

library(dplyr)

df %>%
  arrange(Value2) %>%             
  distinct(ID1, ID2, Year, .keep_all = T) %>%    
  arrange(ID2) %>%
  distinct(ID1, Year, .keep_all = T) %>%  
  arrange(ID1) %>%
  distinct(ID2, Year, .keep_all = T)

#      ID1  ID2 Value1 Value2 Year
# 1    a1   b8      2     23 2004
# 2    a2   b6     NA      4 2008
# 3    a2   b6      4     NA 2009
# 4    a3   b7      5      6 2009
# 5    a4  b99      5     51 2004
# 6    a5   b2      2     23 2005
# 7    a6   b5      6     63 2004
# 8    a8 <NA>      4     NA 2014
# 9    a9   b9      6      4 2016
# 10 <NA>   b7     NA      5 2008

当我们按 Value2 较小的值 Value 将位于顶部,而 distinct 将删除所有重复项并保留其找到的第一行(

When we arrange by Value2 the smaller values of Value will be on top and distinct will remove any duplicates and keep the 1st row it finds (i.e. the one with the smaller Value2).

当我们按 ID1排列时 code>,然后是 ID2 NA 值将位于底部,而则与众不同如果它们重复则将排除它们。

When we arrange by ID1 and then ID2 the NA values will be on the bottom and distinct will exclude them if they are duplicates.

注意我只使用了 Value2 保留较小的值,因为我仍然不清楚您所说的值是什么意思。

Note that I'm using only Value2 to keep small values, as it's still not clear to me what you mean by "value".

这篇关于查找重复项,比较条件,用NA R擦除一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆