提高性能:删除所有字符串中(大)列表中只出现一次 [英] Improve performance: Remove all strings in a (big) list appearing only once
问题描述
我有一个大名单(25000项,14000字)是这样的:
之前(请有下面的右侧列表看看):
文本= ['LOREM你好存有,Lorem存有发电机机',......'你好Lorem存有']
我想删除整个目录只出现一次的所有单词。
之后:
文本= ['LOREM发电机存有,Lorem存有生成机,......,机Lorem存有']
我这样做了,现在,但它很慢(约2小时)。
all_tokens = SUM(文字,[])
tokens_once =集(一个字一个字的设定(all_tokens)如果all_tokens.count(字)== 1)
文本= [文字[字一个字如果词不tokens_once]文本的文本]
如何才能提高性能?
编辑:
@DSM是对的,我的输入列表如下。是我的错,对不起:
文本= ['LOREM','你好','存有'],['LOREM','存有','发电机','机'],... [你好,LOREM,存有]
您可以使用 collections.Counter
这里保存每个单词的数量,然后筛选出基于词在此计数。使用 collection.Counter
就可以得到物品的计数 O(N)
的时间,而你目前的做法( list.count
)开 O(N ** 2)
的时间。
和从来不使用之
压扁列表的列表的,这是非常缓慢(在实际code字符串你列表中,和()
将提高错误的)。我用嵌套列表COM prehension在我的答案,如果你确实有一个列表的列表,然后最好是使用的 itertools.chain.from_iterable
这里。
>>>从集合导入计数器
>>>文本= ['LOREM你好存有,Lorem存有发电机机','你好Lorem存有']
>>> C =计数器(在x.split文本字字X())
>>> [''。加入(Y y的在x.split()如果c [Y]→1)对于x的文本]
['LOREM你好存有,Lorem存有','你好Lorem存有']
定时比较:
[8]:文本= ['LOREM你好存有,Lorem存有发电机机','你好Lorem存有']
在[9]:huge_texts = [x.split()* 100对于x在文本] * 1000 #list列表
在[10]:%% timeit
从集合导入计数器
从itertools进口手拉
C =计数器(chain.from_iterable(huge_texts))
文本= [在X [字一个字,如果C [文字]→1]对于x在huge_texts]
1循环,最好的3:每个环路791毫秒
在[11]:%% timeit
all_tokens =总和(huge_texts,[])
tokens_once =集(一个字一个字的设定(all_tokens)如果all_tokens.count(字)== 1)
文本= [文字[字一个字如果词不tokens_once]文本在huge_texts]
1循环,最好的3:每个环路20.4小号
I have a big list (25000 items, 14000 words) like this:
before (please have a look below for the right list):
texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', ... 'hello Lorem ipsum']
I'd like to remove all words appearing only once in the whole list.
after:
texts = ['Lorem generator ipsum', 'Lorem ipsum generator machine', ..., 'Machine Lorem ipsum']
I'm doing this already now, but it's really slow (about 2 hours).
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
How can I improve the performance?
EDIT:
@DSM was right, my input list looks like this. My fault, sorry.:
texts = [['Lorem', 'hello', 'ipsum'], ['Lorem', 'ipsum', 'generator', 'machine'], ... ['hello, 'Lorem', 'ipsum']]
You can use collections.Counter
here to store the count of each word and then filter out words based on this count. Using collection.Counter
you can get the count of items in O(N)
time, while your current approach(list.count
) takes O(N**2)
time.
And never use sum
for flattening a list of lists, it is very slow(In your actual code you've list of strings, and sum()
will raise error for that). I've used nested list comprehension in my answer, and if you actually have a list of lists then it's better to use itertools.chain.from_iterable
here.
>>> from collections import Counter
>>> texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']
>>> c = Counter(word for x in texts for word in x.split())
>>> [' '.join(y for y in x.split() if c[y] > 1) for x in texts]
['Lorem hello ipsum', 'Lorem ipsum', 'hello Lorem ipsum']
Timing comparison:
In [8]: texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']
In [9]: huge_texts = [x.split()*100 for x in texts]*1000 #list of lists
In [10]: %%timeit
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(huge_texts))
texts = [[word for word in x if c[word]>1] for x in huge_texts]
1 loops, best of 3: 791 ms per loop
In [11]: %%timeit
all_tokens = sum(huge_texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in huge_texts]
1 loops, best of 3: 20.4 s per loop
这篇关于提高性能:删除所有字符串中(大)列表中只出现一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!