将所有元素保留在另一个列表中 [英] Keep all elements in one list from another

查看:99
本文介绍了将所有元素保留在另一个列表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 large 列表trainkeep,其中包含 unique 元素,例如

I have two large lists train and keep, with the latter containing unique elements, for e.g.

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

是否可以使用sets创建具有 all keepkeep元素的新列表的新列表?最终结果应该是:

Is there a way to create a new list that has all the elements of train that are in keep using sets? The end result should be:

train_keep = [1, 3, 4, 3, 1]

当前,我正在使用itertools.filterfalse >如何使列表中的元素基于另一个列表 ,但是由于列表很大,它非常很慢...

Currently I'm using itertools.filterfalse from how to keep elements of a list based on another list but it is very slow as the lists are large...

推荐答案

将列表keep转换为set,因为将经常检查该列表.迭代train,因为您要保持顺序并重复执行.这使得set不能选择.即使是这样,也无济于事,因为无论如何迭代都必须发生:

Convert the list keep into a set, since that will be checked frequently. Iterate over train, since you want to keep order and repeats. That makes set not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:

keeps = set(keep)
train_keep = [k for k in train if k in keeps]

懒惰的,也许是较慢的版本就像

A lazier, and probably slower version would be something like

train_keep = filter(lambda x: x in keeps, train)

这两个选项都不会大大提高您的速度,使用numpy或pandas或其他一些在C中实现循环并将数字存储为比成熟的python对象更简单的库可能会更好.这是一个示例numpy解决方案:

Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:

train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]

这可能是O(M * N)算法,而不是O(M)设置查找,但是如果检查keep中的N元素比名义上的O(1)查找要快,那么您会赢.

This is likely an O(M * N) algorithm rather than O(M) set lookup, but if checking N elements in keep is faster than a nominally O(1) lookup, you win.

您可以使用排序查询来更接近O(M log(N)):

You can get something closer to O(M log(N)) using sorted lookup:

train = np.array([...])
keep = np.array([...])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]

一个更好的选择是将np.inf或最大越界整数附加到排序的keep数组上,因此您完全不必使用extra区分边缘元素中的缺失.像np.max(train.max() + 1, keep.max())这样的事情会做:

A better alternative might be to append np.inf or a maximum out-of-bounds integer to the sorted keep array, so you don't have to distinguish missing from edge elements with extra at all. Something like np.max(train.max() + 1, keep.max()) would do:

train = np.array([...])
keep = np.array([... 99999])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]

对于使用train.size = 10000keep.size = 10的随机输入,在我的笔记本电脑上,numpy方法的速度快约10倍.

For random inputs with train.size = 10000 and keep.size = 10, the numpy method is ~10x faster on my laptop.

这篇关于将所有元素保留在另一个列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆