使用原始文件查找重复的行 [英] Find duplicated rows with original
本文介绍了使用原始文件查找重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我可以在 data.table
dt 中获得
R
code>使用
I can get duplicated rows in R
on a data.table
dt
using
dt[duplicated(dt, by=someColumns)]
但是,我想得到重复的行和非重复的对,例如考虑 dt
:
However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt
:
col1, col2, col3
A B C1
A B C2
A B1 C1
现在, dt [重复(dt,by = c('col1',col2))
会给我一些符合
col1, col2, col3
A B C2
没有选择重复,也就是
col1, col2, col3
A B C1
A B C2
答案速度比较:
> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
user system elapsed
0.008 0.000 0.009
> system.time(dt[, .SD[.N > 1], by = t])
user system elapsed
77.555 0.100 77.703
推荐答案
I believe this is essentially a duplicate of this question, though i can see how you may not have found it...
...这是一个基于引用问题中概述的逻辑的回答:
...here's an answer building off the logic outlined in the referenced question:
dt <- read.table(text = "col1 col2 col3
A B C1
A B C2
A B1 C1", header = TRUE, stringsAsFactors = FALSE)
idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)
dt[idx, ]
#---
col1 col2 col3
1 A B C1
2 A B C2
由于您使用 data.table
,这可能是你想要的:
Since you are using data.table
, this is probably what you want:
library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
col1 col2 col3
1: A B C1
2: A B C2
这篇关于使用原始文件查找重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文