有更优雅的方式来找到重复的记录吗？ [英] Is there a more elegant way to find duplicated records?

查看：123 发布时间：2017/7/21 1:31:20 r duplicates

本文介绍了有更优雅的方式来找到重复的记录吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的测试框中有81,000条记录，复制显示我2039是相同的匹配。在数据中查找重复的行（基于2列）的一个答案R中的框架建议一种仅创建重复记录的较小框架的方法。这也适用于我：

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`

但是，如同海报所指出的那样，似乎是不合时宜的。有一个更清洁的方法来获得相同的结果：只是那些记录是重复的视图？

But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?

在我的情况下，我正在使用刮擦的数据，我需要确定重复项是否存在于原始文件中，或者是由我引用。

In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.

推荐答案

重复（df ）将给您一个逻辑向量（所有值由T / F组成），然后您可以将其用作数据帧行的索引。

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma

你可以一起把它放在一起

You can put it all together in one line

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows

这篇关于有更优雅的方式来找到重复的记录吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有更优雅的方式来找到重复的记录吗？ [英] Is there a more elegant way to find duplicated records?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有更优雅的方式来找到重复的记录吗？ [英] Is there a more elegant way to find duplicated records?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭