快速检查一个向量是否是另一个向量的一部分 [英] Fast check whether one vector is part of another vector
问题描述
我有一个向量列表,比如 vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7)
.我想删除所有列表成员,如果它包含在另一个列表成员中.例如,vecs$vec1
是 vecs$vec2
(或 vecs$vec4
)的一部分,所以我想删除它.>
我想要一个非常快的实现,因为 length(vecs)
非常大.我所做的是首先按长度对 vecs
成员进行排序 vecs = vecs[ order(unlist(lapply(vecs, length))) ]
,然后对于 check_member = vecs[i]
,检查它是否是vecs[ i+1 ], vecs[ i+2 ]...
的一部分.有没有更好的策略?完整代码:
vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7, vec5=2:3)vecs = vecs[ order(unlist(lapply(vecs, length))) ] ##按成员长度排序vecs_len = 长度(vecs)toRemove = numeric(vecs_len) ##记录是否移除该成员for( i in 1:(vecs_len-1 ))for( j in (i+1):(vecs_len )){if(length(setdiff(vecs[[i]],vecs[[j]]))==0) {toRemove[i] = 1;break} ##检查 vecs[[i]] 是否是 vecs[[j]] 的一部分}vecs = vecs[!toRemove]
请在您的大数据上试试这个.正如您在评论中所阐明的那样,我从字符向量开始.我还假设向量不包含重复项(很容易通过 lapply(vecs, unique)
修复)
vecs <- list(vec1=letters[1:3],vec2=字母[1:5],vec3=字母[2:6],vec4=字母[1:7])
首先,将您的数据转换为共享相同级别的因子(即整数)列表:
vlevels <- unique(unlist(vecs))nlev <- 长度(vlevels)fvecs <- lapply(vecs, factor, levels = vlevels)
然后,我将数据转换为包含/不包含的 0
和 1
矩阵:
vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)# vec1 vec2 vec3 vec4# [1,] 1 1 0 1# [2,] 1 1 1 1# [3,] 1 1 1 1# [4,] 0 1 1 1# [5,] 0 1 1 1# [6,] 0 0 1 1# [7,] 0 0 0 1
接下来,我计算叉积以查看两个向量共有多少个元素,并与每个向量的长度进行比较以确定第一个向量是否是第二个向量的子集.
num.match <- crossprod(vtabs)vlen <- sapply(vecs, 长度)is.subset.mat <- num.match == vlendiag(is.subset.mat) <- FALSE# vec1 vec2 vec3 vec4# vec1 FALSE TRUE FALSE TRUE# vec2 假假假真# vec3 假假假真# vec4 假假假假
最后,我逐行求和以判断每个向量是否是其他向量的子集:
is.subset <- rowSums(is.subset.mat) >0升# vec1 vec2 vec3 vec4# 真真真假
你只剩下过滤:
out <- vecs[!is.subset]
如果您的初始列表太大以至于速度太慢,那么我建议您逐块运行它,然后再运行结果.
I have a list of vectors, say vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7)
. I want to remove all the list members that if it is contained in another list member. For example, vecs$vec1
is part of vecs$vec2
, (or vecs$vec4
), so I want to remove it.
I want a very fast implementation, because length(vecs)
is very big. What I did is first sort vecs
member by length vecs = vecs[ order(unlist(lapply(vecs, length))) ]
, then for check_member = vecs[i]
, check whether it is part of vecs[ i+1 ], vecs[ i+2 ]...
.Is there any better strategy? Complete code:
vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7, vec5=2:3)
vecs = vecs[ order(unlist(lapply(vecs, length))) ] ##sort by member length
vecs_len = length(vecs)
toRemove = numeric(vecs_len ) ##record whether to remove this member
for( i in 1:(vecs_len-1 ))
for( j in (i+1):(vecs_len ))
{
if( length( setdiff(vecs[[i]],vecs[[j]]) )==0 ) {toRemove[i] = 1; break} ##check whether vecs[[i]] is part of vecs[[j]]
}
vecs = vecs[!toRemove]
Please try this on your large data. I am starting from vectors of characters as you clarified in your comment. I am also assuming the vectors do not contain duplicates (easy to fix via lapply(vecs, unique)
if they do)
vecs <- list(vec1=letters[1:3],
vec2=letters[1:5],
vec3=letters[2:6],
vec4=letters[1:7])
First, turn your data into a list of factors (i.e. integers) sharing the same levels:
vlevels <- unique(unlist(vecs))
nlev <- length(vlevels)
fvecs <- lapply(vecs, factor, levels = vlevels)
Then, I convert the data into a matrix of 0
and 1
for included/not-included:
vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)
# vec1 vec2 vec3 vec4
# [1,] 1 1 0 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
# [4,] 0 1 1 1
# [5,] 0 1 1 1
# [6,] 0 0 1 1
# [7,] 0 0 0 1
Next, I compute the crossproduct to see how many elements two vectors have in common, and compare with the length of each vector to decide whether the first one is a subset of the second one.
num.match <- crossprod(vtabs)
vlen <- sapply(vecs, length)
is.subset.mat <- num.match == vlen
diag(is.subset.mat) <- FALSE
# vec1 vec2 vec3 vec4
# vec1 FALSE TRUE FALSE TRUE
# vec2 FALSE FALSE FALSE TRUE
# vec3 FALSE FALSE FALSE TRUE
# vec4 FALSE FALSE FALSE FALSE
Finally, I sum row-wise to tell if each vector is a subset of any other:
is.subset <- rowSums(is.subset.mat) > 0L
# vec1 vec2 vec3 vec4
# TRUE TRUE TRUE FALSE
You are only left with filtering:
out <- vecs[!is.subset]
If your initial list is so large that this is too slow, then I would suggest that you run it chunk-by-chunk, then again on the results.
这篇关于快速检查一个向量是否是另一个向量的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!