有效地在列表中查找唯一的向量元素 [英] finding unique vector elements in a list efficiently

查看:78
本文介绍了有效地在列表中查找唯一的向量元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数字向量列表,我需要创建一个列表,每个列表仅包含一个副本.没有用于相同函数的列表方法,因此我编写了一个函数以应用于检查每个向量是否彼此相对.

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.

F1 <- function(x){

    to_remove <- c()
    for(i in 1:length(x)){
        for(j in 1:length(x)){
            if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
        }
    }
    if(is.null(to_remove)) x else x[-c(to_remove)] 
} 

问题在于,随着输入列表x大小的增加,此函数变得非常慢,部分原因是因为for循环分配了两个大向量.我希望有一种方法可以在一分钟之内运行一个长度为150万,长度为15的向量的列表,但这可能是乐观的.

The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.

有人知道将列表中的每个向量与其他向量进行比较的更有效方法吗?向量本身的长度保证相等.

Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.

样品输出如下所示.

x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)

推荐答案

根据@JoshuaUlrich和@thelatemail,ll[!duplicated(ll)]可以正常工作.
因此,unique(ll)应该也是如此 我以前曾建议使用sapply的方法,其想法是不检查列表中的每个元素(我删除了该答案,因为我认为使用unique更有意义)

As per @JoshuaUlrich and @thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll) I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)

# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll

将其置于某些becnhmarks中

fun1 <- function(ll) {
  ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}

fun2 <- function(ll) {
  ll[!duplicated(sapply(ll, digest))]
}

fun3 <- function(ll)  {
  ll[!duplicated(ll)]
}

fun4 <- function(ll)  {
  unique(ll)
}

#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)), 
    identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE


library(rbenchmark)

benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]

        test elapsed relative user.self sys.self
3     unique   0.048    1.000     0.049    0.000
2 duplicated   0.050    1.042     0.050    0.000
1     digest   8.427  175.563     8.415    0.038
# I took out fun1, since when ll is large, it ran extremely slow

最快选项:

unique(ll)

这篇关于有效地在列表中查找唯一的向量元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆