要删除重复的行，除非列中不存在NA值 [英] Want to remove duplicated rows unless NA value exists in columns

查看：56 发布时间：2021/4/23 20:46:07 r duplicates data.table conditional-statements distinct

本文介绍了要删除重复的行，除非列中不存在NA值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含4列的数据表:ID，名称，Rate1，Rate2.

我想删除ID，Rate1和Rate 2相同的重复项，但是，如果它们都是NA，我想保留这两行./p>

基本上，我想有条件地删除重复项，但前提是条件！=不适用.

例如，我想要这样:

  ID名称Rate1 Rate21 Xyz 1 21 Abc 1 22 def不适用不适用2 Lmn不适用不适用3朝圣者3 53季度3 7

成为这个:

  ID名称Rate1 Rate21 Xyz 1 22 def不适用不适用2 Lmn不适用不适用3朝圣者3 53季度3 7

提前谢谢！

我知道可以只获取比率"为NA的数据表的子集，然后删除剩下的重复项，然后再添加NA行-但是，我宁愿避免这种策略.这是因为实际上我想连续执行很多对汇率.

为清楚起见，在示例中增加了几行.

解决方案

一个 base R 选项将对不带名称"的数据集子集使用 duplicated 列(即列索引2)以创建逻辑向量，取反(！-TRUE变为FALSE，反之亦然)，以便TRUE将为非重复行.随之在逻辑矩阵上用 rowSums 创建另一个条件( is.na(df1 [3:4])-为列评分)，以获得所有均为NA的行-在这里，我们将其与2进行比较-即数据集中的费率"列数).这两个条件都由 | 连接起来以创建预期的逻辑索引

  i1<-！duplicated(df1 [-2])|rowSums(is.na(df1 [3:4]))== 2df1 [i1，]#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA

或与 base R

中的 Reduce 一起使用

  df1 [Reduce(`&`，lapply(df1 [3:4]，is.na)))|！duplicated(df1 [-2])，]

将其包装在函数中

  f1<-函数(dat，i，method){nm1<-grep("^ Rate"，colnames(dat)，值= TRUE)i1<-！duplicated(dat [-i])i2<-开关(方法，"rowSums" = rowSums(is.na(dat [nm1]))==长度(nm1)，"Reduce" =减少(`&`，lapply(dat [nm1]，is.na)))i3<-i1 | i2dat [i3，]}

-测试

  f1(df1，2，"rowSums")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NAf1(df1，2，减少")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NAf1(df2，2，"rowSums")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA#5 3朝圣者3 5#6 3 Qrs 3 7f1(df2，2，减少")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA#5 3朝圣者3 5#6 3 Qrs 3 7

如果有多个费率"列(例如100或更多-在第一个解决方案中唯一要更改的内容是" 2 "应更改为费率"列的数量)

或使用 tidyverse

 库(tidyvesrse)df1％>％group_by(ID)％&％;％filter_at(vars(Rate1，Rate2)，any_vars(！duplicated(.)| is.na(.)))#小动作:3 x 4#组:ID [2]#ID名称Rate1 Rate2#< int>< chr>< int>< int>#1 1 Xyz 1 2#2 2 Def NA NA#3 2 Lmn NA NAdf2％>％group_by(ID)％&％;％filter_at(vars(Rate1，Rate2)，any_vars(！duplicated(.)| is.na(.)))#小动作:5 x 4#组:ID [3]#ID名称Rate1 Rate2#< int>< chr>< int>< int>#1 1 Xyz 1 2#2 2 Def NA NA#3 2 Lmn NA NA#4 3 Hij 3 5#5 3 Qrs 3 7

数据

  df1<-结构(list(ID = c(1L，1L，2L，2L)，Name = c("Xyz"，"Abc"，"Def"，"Lmn")，Rate1 = c(1L，1L，NA，NA)，Rate2 = c(2L，2L，NA，NA))，类="data.frame"，row.names = c(NA，-4L))df2<-structure(list(ID = c(1L，1L，2L，2L，3L，3L)，名称= c("Xyz"，"Abc"，"Def"，"Lmn"，"Hij"，"Qrs")，Rate1 = c(1L，1L，NA，NA，3L，3L)，Rate2 = c(2L，2L，NA，NA，5L，7L))，类别="data.frame"，row.names = c(NA，-6L))

I have a data table with 4 columns: ID, Name, Rate1, Rate2.

I want to remove duplicates where ID, Rate1, and Rate 2 are the same, but if they are both NA, I would like to keep both rows.

Basically, I want to conditionally remove duplicates, but only if the conditions != NA.

For example, I would like this:

ID   Name   Rate1    Rate2
1    Xyz    1        2
1    Abc    1        2
2    Def    NA       NA
2    Lmn    NA       NA
3    Hij    3        5
3    Qrs    3        7

to become this:

ID   Name   Rate1    Rate2
1    Xyz    1        2
2    Def    NA       NA
2    Lmn    NA       NA
3    Hij    3        5
3    Qrs    3        7

Thanks in advance!

EDIT: I know it's possible to just take a subset of the data table where the Rates are NA, then remove duplicates on what's left, then add the NA rows back in - but, I would rather avoid this strategy. This is because in reality there are quite a few couplets of rates that I want to do this for consecutively.

EDIT2: Added in some more rows to the example for clarity.

解决方案

A base R option would be to use duplicated on the subset of dataset without the 'Name' column i.e. column index 2 to create a logical vector, negate (! - TRUE becomes FALSE and viceversa) so that TRUE would be non-duplicated rows. Along with that create another condition with rowSumson a logical matrix (is.na(df1[3:4]) - Rate columns) to get rows that are all NA's - here we compare it with 2 - i.e. the number of Rate columns in the dataset). Both the conditions are joined by | to create the expected logical index

i1 <- !duplicated(df1[-2])| rowSums(is.na(df1[3:4])) == 2
df1[i1,]
#    ID Name Rate1 Rate2
#1  1  Xyz     1     2
#3  2  Def    NA    NA
#4  2  Lmn    NA    NA

Or with Reduce from base R

df1[Reduce(`&`, lapply(df1[3:4], is.na)) | !duplicated(df1[-2]), ]

Wrapping it in a function

f1 <- function(dat, i, method ) {     

         nm1 <- grep("^Rate", colnames(dat), value = TRUE)    
         i1 <- !duplicated(dat[-i])  
         i2 <-  switch(method, 
           "rowSums" = rowSums(is.na(dat[nm1])) == length(nm1),
           "Reduce" = Reduce(`&`, lapply(dat[nm1], is.na))

         )   
         i3 <- i1|i2
         dat[i3,]
     }

-testing

f1(df1, 2, "rowSums")
#  ID Name Rate1 Rate2
#1  1  Xyz     1     2
#3  2  Def    NA    NA
#4  2  Lmn    NA    NA

f1(df1, 2, "Reduce")
#  ID Name Rate1 Rate2
#1  1  Xyz     1     2
#3  2  Def    NA    NA
#4  2  Lmn    NA    NA

f1(df2, 2, "rowSums")
#  ID Name Rate1 Rate2
#1  1  Xyz     1     2
#3  2  Def    NA    NA
#4  2  Lmn    NA    NA
#5  3  Hij     3     5
#6  3  Qrs     3     7

f1(df2, 2, "Reduce")
#  ID Name Rate1 Rate2
#1  1  Xyz     1     2
#3  2  Def    NA    NA
#4  2  Lmn    NA    NA
#5  3  Hij     3     5
#6  3  Qrs     3     7

if there are multiple 'Rate' columns (say 100 or more - only thing to change in the first solution is 2 should be changed to the number of 'Rate' columns)

Or using tidyverse

library(tidyvesrse)
df1 %>%
    group_by(ID) %>%
    filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 3 x 4
# Groups:   ID [2]
#     ID Name  Rate1 Rate2
#  <int> <chr> <int> <int>
#1     1 Xyz       1     2
#2     2 Def      NA    NA
#3     2 Lmn      NA    NA



df2 %>% 
     group_by(ID) %>%
     filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 5 x 4
# Groups:   ID [3]
#     ID Name  Rate1 Rate2
#  <int> <chr> <int> <int>
#1     1 Xyz       1     2
#2     2 Def      NA    NA
#3     2 Lmn      NA    NA
#4     3 Hij       3     5
#5     3 Qrs       3     7

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 2L), Name = c("Xyz", "Abc", 
"Def", "Lmn"), Rate1 = c(1L, 1L, NA, NA), Rate2 = c(2L, 2L, NA, 
 NA)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Name = c("Xyz", 
 "Abc", "Def", "Lmn", "Hij", "Qrs"), Rate1 = c(1L, 1L, NA, NA, 
 3L, 3L), Rate2 = c(2L, 2L, NA, NA, 5L, 7L)), class = "data.frame", 
 row.names = c(NA, -6L))

这篇关于要删除重复的行，除非列中不存在NA值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

要删除重复的行，除非列中不存在NA值 [英] Want to remove duplicated rows unless NA value exists in columns

问题描述

数据

data

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

要删除重复的行，除非列中不存在NA值 [英] Want to remove duplicated rows unless NA value exists in columns

问题描述

数据

data

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭