检查重复项是否跨R中的两列 [英] Checking duplicates cross two columns in R
问题描述
例如,我的数据集如下:
For example, my data set is like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
3 CDE DEF 0.3
4 BCD ABC 0.5
unique
和duplicated
可能无法检测到第3行和第4行的重复.
unique
and duplicated
may not able to detect the duplication of row 3 and 4.
由于我的数据集很大,是否有任何有效的方法来仅保留唯一的行? 像这样:
Since my data set is quite large so is there any efficient way to only keep the unique rows? Like this:
Var1 Var2 value
1 ABC BCD 0.5
2 DEF CDE 0.3
为了使您信服,您可以使用:
For your convince, you can use:
dat <- data.frame(Var1 = c("ABC", "DEF", "CDE", "BCD"),
Var2 = c("BCD", "CDE", "DEF", "ABC"),
value = c(0.5, 0.3, 0.3, 0.5))
此外,如有可能,还可以根据Var1(超过10,000个级别)为前20个变量生成一个分布表.
Also, if possible is there any way to also produce a distribution table for the top 20 variables base on the Var1 (more than 10,000 levels).
P.S.我已经尝试过dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)]
,但是运行时间太长.
P.S. I have tried dat$count <- dat(as.character(dat$Var1))[as.character(dat$Var1)]
, but it just take too long to run.
推荐答案
另一种选择是按行对列Var1
和Var2
进行排序,然后应用duplicated
.
Another option would be to sort columns Var1
and Var2
rowwise and then apply duplicated
.
idx <- !duplicated(t(apply(dat[c("Var1", "Var2")], 1, sort)))
dat[idx, ]
# Var1 Var2 value
#1 ABC BCD 0.5
#2 DEF CDE 0.3
这篇关于检查重复项是否跨R中的两列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!