在 R 中查找重复值 [英] Find duplicate values in R

查看:438
本文介绍了在 R 中查找重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 21638 个唯一*行的表格:

I have a table with 21638 unique* rows:

vocabulary <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Vocabulary.txt", header=T)

此表有五列,第一列保存受访者 ID 号.我想检查是否有任何受访者出现了两次,或者是否所有受访者都是唯一的.

This table has five columns, the first of which holds the respondent ID numbers. I want to check if any respondents appear twice, or if all respondents are unique.

计算我可以使用的唯一 ID

To count unique IDs I can use

length(unique(vocabulary$id))

并检查是否有我可能会做的任何重复

and to check if there are any duplicates I might do

length(unique(vocabulary$id)) == nrow(vocabulary)

返回 TRUE,如果没有重复项(没有).

which returns TRUE, if there are no duplicates (which there aren't).

我的问题:

是否有直接返回重复值或行号的方法?

Is there a direct way to return the values or line numbers of duplicates?

一些进一步的解释:

使用函数 duplicated() 存在解释问题,因为它只返回严格意义上的重复项,不包括原始".例如,sum(duplicated(vocabulary$id))dim(vocabulary[duplicated(vocabulary$id),])[1] 可能会返回5"作为重复行数.问题是,如果您只知道重复的数量,您将不知道它们重复了多少行.5"是指有五行,每行有一个重复,还是有一行有五个重复?而且由于您没有重复项的 ID 或行号,因此您将无法找到原件".

There is an interpretation problem with using the function duplicated(), because is only returns the duplicates in the strict sense, excluding the "originals". For example, sum(duplicated(vocabulary$id)) or dim(vocabulary[duplicated(vocabulary$id),])[1] might return "5" as the number of duplicate rows. The problem is that if you only know the number of duplicates, you won't know how many rows they duplicate. Does "5" mean that there are five rows with one duplicate each, or that there is one row with five duplicates? And since you won't have the IDs or line numbers of the duplicates, you wouldn't have any means of finding the "originals".

*我知道这个调查中没有重复的 ID,但这是一个很好的例子,因为使用了这个问题在别处给出的任何答案,比如 duplicated(vocabulary$id)table(vocabulary$id) 将向您的屏幕输出一个大海捞针,您将无法在其中找到任何可能的罕见重复针.

*I know there are no duplicate IDs in this survey, but it is a good example, because using any of the answers given elsewhere to this question, like duplicated(vocabulary$id) or table(vocabulary$id) will output a haystack to your screen in which you'll be quite unable to find any possible rare duplicate needles.

推荐答案

你可以使用table,即

n_occur <- data.frame(table(vocabulary$id))

为您提供一个包含 id 列表及其出现次数的数据框.

gives you a data frame with a list of ids and the number of times they occurred.

n_occur[n_occur$Freq > 1,]

告诉您哪些 id 出现了多次.

tells you which ids occurred more than once.

vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]

返回出现多次的记录.

这篇关于在 R 中查找重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆