如何确定重复的行,在一列中并非全部相同? [英] how to determine duplicate rows where not all are the same in a column?

查看:61
本文介绍了如何确定重复的行,在一列中并非全部相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我要查找列的重复行:

suppose I want to find duplicate rows for columns:

              cols<-c("col1", "col2")

我知道数据f4重复行是:

I know for data f4 duplicate rows are:

      Jo<-df4[duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE), ]

并从数据集中删除这些重复的行给出:

and removing these duplicate rows from data set is given:

      No<-df4[!(duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE)), ]

我想修改以上代码。假设有一列称为模式。它需要1到4之间的整数。我不希望所有重复的行都具有相同的mode == 2。

I want to modify the above codes. Suppose there is a column called mode. It takes integers between 1 to 4. I don't want all of duplicate rows have the same mode==2.

示例

          col1       col2        mode
            1          3           5
            5          3           9
            1          2           1
            1          2           1
            3          2           2
            3          2           2
            4          1           3
            4          1           2
            4          1           2

输出

          Jo:

          col1       col2        mode
            1          2           1
            1          2           1
            4          1           3
            4          1           2
            4          1           2

          No:

          col1       col2        mode
            1          3           5
            5          3           9
            3          2           2
            3          2           2

在上述示例中,从模式开始的第3和第4行== 2两者都不是重复的,而是最后三行,因为其中一个不是2,就是重复的

in the above example in 3 and 4-th rows since mode==2 for both it is not duplicate but for three last row since one of them is not 2 , the are duplicate

推荐答案

基于更新的数据集,

library(dplyr)
out1 <- df2 %>%
            group_by_at(vars(cols)) %>%
            filter(n() > 1, !all(mode ==2)) 


out2 <- anti_join(df2, out1)
out1
# A tibble: 5 x 3
# Groups:   col1, col2 [2]
#   col1  col2  mode
#  <int> <int> <int>
#1     1     2     1
#2     1     2     1
#3     4     1     3
#4     4     1     2
#5     4     1     2

out2
#  col1 col2 mode
#1    1    3    5
#2    5    3    9
#3    3    2    2
#4    3    2    2






或使用 data.table

library(data.table)
i1 <- setDT(df2)[ ,  .I[.N > 1 & !all(mode == 2)],  by = cols]$V1
df2[i1]
#   col1 col2 mode
#1:    1    2    1
#2:    1    2    1
#3:    4    1    3
#4:    4    1    2
#5:    4    1    2

df2[!i1]
#   col1 col2 mode
#1:    1    3    5
#2:    5    3    9
#3:    3    2    2
#4:    3    2    2






或使用 base R

i1 <- duplicated(df2[1:2])|duplicated(df2[1:2], fromLast = TRUE)
out11 <- df2[i1 & with(df2, !ave(mode==2, col1, col2, FUN = all)),]
out22 <- df2[setdiff(row.names(df2), row.names(out11)),]



数据



data

df2 <- structure(list(col1 = c(1L, 5L, 1L, 1L, 3L, 3L, 4L, 4L, 4L), 
    col2 = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), mode = c(5L, 
    9L, 1L, 1L, 2L, 2L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA, 
-9L))

这篇关于如何确定重复的行,在一列中并非全部相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆