将所有元素保留在另一个列表中 [英] Keep all elements in one list from another
问题描述
我有两个 large 列表train
和keep
,其中包含 unique 元素,例如
I have two large lists train
and keep
, with the latter containing unique elements, for e.g.
train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
是否可以使用sets
创建具有 all keep
中keep
元素的新列表的新列表?最终结果应该是:
Is there a way to create a new list that has all the elements of train
that are in keep
using sets
? The end result should be:
train_keep = [1, 3, 4, 3, 1]
当前,我正在使用
Currently I'm using itertools.filterfalse
from how to keep elements of a list based on another list but it is very slow as the lists are large...
推荐答案
将列表keep
转换为set
,因为将经常检查该列表.迭代train
,因为您要保持顺序并重复执行.这使得set
不能选择.即使是这样,也无济于事,因为无论如何迭代都必须发生:
Convert the list keep
into a set
, since that will be checked frequently. Iterate over train
, since you want to keep order and repeats. That makes set
not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:
keeps = set(keep)
train_keep = [k for k in train if k in keeps]
懒惰的,也许是较慢的版本就像
A lazier, and probably slower version would be something like
train_keep = filter(lambda x: x in keeps, train)
这两个选项都不会大大提高您的速度,使用numpy或pandas或其他一些在C中实现循环并将数字存储为比成熟的python对象更简单的库可能会更好.这是一个示例numpy解决方案:
Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:
train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]
这可能是O(M * N)
算法,而不是O(M)
设置查找,但是如果检查keep
中的N
元素比名义上的O(1)
查找要快,那么您会赢.
This is likely an O(M * N)
algorithm rather than O(M)
set lookup, but if checking N
elements in keep
is faster than a nominally O(1)
lookup, you win.
您可以使用排序查询来更接近O(M log(N))
:
You can get something closer to O(M log(N))
using sorted lookup:
train = np.array([...])
keep = np.array([...])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
一个更好的选择是将np.inf
或最大越界整数附加到排序的keep
数组上,因此您完全不必使用extra
区分边缘元素中的缺失.像np.max(train.max() + 1, keep.max())
这样的事情会做:
A better alternative might be to append np.inf
or a maximum out-of-bounds integer to the sorted keep
array, so you don't have to distinguish missing from edge elements with extra
at all. Something like np.max(train.max() + 1, keep.max())
would do:
train = np.array([...])
keep = np.array([... 99999])
keep.sort()
ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]
对于使用train.size = 10000
和keep.size = 10
的随机输入,在我的笔记本电脑上,numpy方法的速度快约10倍.
For random inputs with train.size = 10000
and keep.size = 10
, the numpy method is ~10x faster on my laptop.
这篇关于将所有元素保留在另一个列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!