选择除一列之外的所有重复行 [英] Select all rows which are duplicates except for one column

查看:37
本文介绍了选择除一列之外的所有重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在数据集中查找所有列(除了一列)中的值都匹配的行.在多次尝试让duplicated() 返回重复行的所有实例(不仅仅是第一个实例)失败后,我想出了一种方法(见下文).

I want to find rows in a dataset where the values in all columns, except for one, match. After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).

例如,我想识别 Iris 数据集中除 Petal.Width 之外的所有相等的行.

For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.

require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer =  iris%>%semi_join(dups)

> answer 
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           5.1         3.5          1.4         0.2    setosa
2           4.9         3.1          1.5         0.1    setosa
3           4.8         3.0          1.4         0.1    setosa
4           5.1         3.5          1.4         0.3    setosa
5           4.9         3.1          1.5         0.2    setosa
6           4.8         3.0          1.4         0.3    setosa
7           5.8         2.7          5.1         1.9 virginica
8           6.7         3.3          5.7         2.1 virginica
9           6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

如您所见,这是可行的,但这是我几乎可以肯定很多其他人需要此功能的时候之一,并且我不知道可以通过更少的步骤或通常更整洁的方式.有什么建议吗?

As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way. Any suggestions?

另一种方法,至少来自 两个 其他 帖子,适用于这种情况是:

An alternate approach, from at least two other posts, applied to this case would be:

answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]

但这似乎也只是一种不同的解决方法,而不是单一功能.两种方法都需要相同的时间.(在我的系统上为 0.08 秒).没有更简洁/更快的方法吗?

But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time. (0.08 sec on my system). Is there no neater/faster way of doing this?

例如就像是iris%>%duplicates(all=TRUE,ignore=Petal.Width)

e.g. something like iris%>%duplicates(all=TRUE,ignore=Petal.Width)

推荐答案

iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]

Of重复行(与第 4 列无关) duplicated(iris[,-4]) 给出重复集的第二行,第 18、35、46、133、143 行 &145 和 duplicated(iris[,-4], fromLast = TRUE) 给出每个重复集的第一行,1、10、13、102、125 和 129.通过添加 | 这会导致 12 个 TRUE s,因此它返回预期的输出.

Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | this results in 12 TRUEs, so it returns the expected output.

或者使用 dplyr:基本上,您对除 Petal.Width 之外的所有变量进行分组,计算它们出现的次数,并过滤出现多次的变量.

Or perhaps with dplyr: Basically you group on all variables except Petal.Width, count how much they occur, and filter those which occur more than once.

library(dplyr)
iris %>% 
  group_by_at(vars(-Petal.Width)) %>% 
  filter(n() > 1)

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
 1          5.1         3.5          1.4         0.2    setosa
 2          4.9         3.1          1.5         0.1    setosa
 3          4.8         3.0          1.4         0.1    setosa
 4          5.1         3.5          1.4         0.3    setosa
 5          4.9         3.1          1.5         0.2    setosa
 6          4.8         3.0          1.4         0.3    setosa
 7          5.8         2.7          5.1         1.9 virginica
 8          6.7         3.3          5.7         2.1 virginica
 9          6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

这篇关于选择除一列之外的所有重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆