使用原始文件查找重复的行 [英] Find duplicated rows with original

查看:92
本文介绍了使用原始文件查找重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在 data.table dt 中获得 R code>使用

I can get duplicated rows in R on a data.table dt using

dt[duplicated(dt, by=someColumns)] 

但是,我想得到重复的行和非重复的对,例如考虑 dt

However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt:

col1, col2, col3 
   A     B    C1
   A     B    C2
   A    B1    C1

现在, dt [重复(dt,by = c('col1',col2))会给我一些符合

col1, col2, col3
   A     B    C2

没有选择重复,也就是

col1, col2, col3 
   A     B    C1
   A     B    C2

答案速度比较

> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
   user  system elapsed 
  0.008   0.000   0.009 
> system.time(dt[, .SD[.N > 1], by = t])
   user  system elapsed 
 77.555   0.100  77.703 


推荐答案

我相信这本质上是一个重复的

I believe this is essentially a duplicate of this question, though i can see how you may not have found it...

...这是一个基于引用问题中概述的逻辑的回答:

...here's an answer building off the logic outlined in the referenced question:

dt <- read.table(text = "col1 col2 col3 
   A     B    C1
   A     B    C2
   A    B1    C1", header = TRUE, stringsAsFactors = FALSE)


idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)

dt[idx, ]
#---
  col1 col2 col3
1    A    B   C1
2    A    B   C2

由于您使用 data.table ,这可能是你想要的:

Since you are using data.table, this is probably what you want:

library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
   col1 col2 col3
1:    A    B   C1
2:    A    B   C2

这篇关于使用原始文件查找重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆