以最快的方式删除R中的所有重复项 [英] Fastest way to remove all duplicates in R

查看：115 发布时间：2017/7/20 23:22:10 r performance duplicates unique

本文介绍了以最快的方式删除R中的所有重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想删除在矢量中出现的不止一次的所有项目。具体来说，这包括字符，数字和整数向量。目前，我正在使用 duplicate（）（使用 fromLast 参数）。

I'd like to remove all items that appear more than once in a vector. Specifically, this includes character, numeric and integer vectors. Currently, I'm using duplicated() both forwards and backward (using the fromLast parameter).

在R中执行这个更有计算效率（更快）的方法吗？下面的解决方案很简单，可以写/读，但是执行两次重复搜索似乎效率不高。也许使用附加数据结构的基于计数的方法会更好吗？

Is there a more computationally efficient (faster) way to execute this in R? The solution below is simple enough to write/read, but it seems inefficient to execute the duplicate search twice. Perhaps a counting-based method using an additional data structure would be better?

示例：

d <- c(1,2,3,4,1,5,6,4,2,1)
d[!(duplicated(d) | duplicated(d, fromLast=TRUE))]
#[1] 3 5 6

此处和此处。

推荐答案

某些时间：

set.seed(1001)
d <- sample(1:100000, 100000, replace=T)
d <- c(d, sample(d, 20000, replace=T))  # ensure many duplicates
mb <- microbenchmark::microbenchmark(
  d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
  setdiff(d, d[duplicated(d)]),
  {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
  as.integer(names(table(d)[table(d)==1])),
  d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
  d[!(d %in% d[duplicated(d)])],
  { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
  d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
)
summary(mb)[, c(1, 4)]  # in milliseconds
#                                                                                expr      mean
#1                               d[!(duplicated(d) | duplicated(d, fromLast = TRUE))]  18.34692
#2                                                       setdiff(d, d[duplicated(d)])  24.84984
#3                       {     tmp <- rle(sort(d))     tmp$values[tmp$lengths == 1] }   9.53831
#4                                         as.integer(names(table(d)[table(d) == 1])) 255.76300
#5               d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))]  18.35360
#6                                                      d[!(d %in% d[duplicated(d)])]  24.01009
#7                        {     ud = unique(d)     ud[tabulate(match(d, ud)) == 1L] }  32.10166
#8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d,      F, T, NA)))]  18.33475

看看他们是否正确？

 results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
         setdiff(d, d[duplicated(d)]),
         {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
         as.integer(names(table(d)[table(d)==1])),
         d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
         d[!(d %in% d[duplicated(d)])],
         { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
         d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
 all(sapply(ls, all.equal, c(3, 5, 6)))
 # TRUE

这篇关于以最快的方式删除R中的所有重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

以最快的方式删除R中的所有重复项 [英] Fastest way to remove all duplicates in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

以最快的方式删除R中的所有重复项 [英] Fastest way to remove all duplicates in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭