提高性能:删除所有字符串中(大)列表中只出现一次 [英] Improve performance: Remove all strings in a (big) list appearing only once

查看:161
本文介绍了提高性能:删除所有字符串中(大)列表中只出现一次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大名单(25000项,14000字)是这样的:

之前(请有下面的右侧列表看看):

 文本= ['LOREM你好存有,Lorem存有发电机机',......'你好Lorem存有']
 

我想删除整个目录只出现一次的所有单词。

之后:

 文本= ['LOREM发电机存有,Lorem存有生成机,......,机Lorem存有']
 

我这样做了,现在,但它很慢(约2小时)。

  all_tokens = SUM(文字,[])
tokens_once =集(一个字一个字的设定(all_tokens)如果all_tokens.count(字)== 1)
文本= [文字[字一个字如果词不tokens_once]文本的文本]
 

如何才能提高性能?​​


编辑:

@DSM是对的,我的输入列表如下。是我的错,对不起:

 文本= ['LOREM','你好','存有'],['LOREM','存有','发电机','机'],... [你好,LOREM,存有]
 

解决方案

您可以使用 collections.Counter 这里保存每个单词的数量,然后筛选出基于词在此计数。使用 collection.Counter 就可以得到物品的计数 O(N)的时间,而你目前的做法( list.count )开 O(N ** 2)的时间。

从来不使用压扁列表的列表的,这是非常缓慢(在实际code字符串你列表中,和()将提高错误的)。我用嵌套列表COM prehension在我的答案,如果你确实有一个列表的列表,然后最好是使用的 itertools.chain.from_iterable 这里。

 >>>从集合导入计数器
>>>文本= ['LOREM你好存有,Lorem存有发电机机','你好Lorem存有']
>>> C =计数器(在x.split文本字字X())
>>> [''。加入(Y y的在x.split()如果c [Y]→1)对于x的文本]
['LOREM你好存有,Lorem存有','你好Lorem存有']
 

定时比较:

  [8]:文本= ['LOREM你好存有,Lorem存有发电机机','你好Lorem存有']

在[9]:huge_texts = [x.split()* 100对于x在文本] * 1000 #list列表

在[10]:%% timeit
从集合导入计数器
从itertools进口手拉
C =计数器(chain.from_iterable(huge_texts))
文本= [在X [字一个字,如果C [文字]→1]对于x在huge_texts]

1循环,最好的3:每个环路791毫秒

在[11]:%% timeit
all_tokens =总和(huge_texts,[])
tokens_once =集(一个字一个字的设定(all_tokens)如果all_tokens.count(字)== 1)
文本= [文字[字一个字如果词不tokens_once]文本在huge_texts]

1循环,最好的3:每个环路20.4小号
 

I have a big list (25000 items, 14000 words) like this:

before (please have a look below for the right list):

texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', ... 'hello Lorem ipsum']

I'd like to remove all words appearing only once in the whole list.

after:

texts = ['Lorem generator ipsum', 'Lorem ipsum generator machine', ..., 'Machine Lorem ipsum']

I'm doing this already now, but it's really slow (about 2 hours).

all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

How can I improve the performance?


EDIT:

@DSM was right, my input list looks like this. My fault, sorry.:

texts = [['Lorem', 'hello', 'ipsum'], ['Lorem', 'ipsum', 'generator', 'machine'], ... ['hello, 'Lorem', 'ipsum']]

解决方案

You can use collections.Counter here to store the count of each word and then filter out words based on this count. Using collection.Counter you can get the count of items in O(N) time, while your current approach(list.count) takes O(N**2) time.

And never use sum for flattening a list of lists, it is very slow(In your actual code you've list of strings, and sum() will raise error for that). I've used nested list comprehension in my answer, and if you actually have a list of lists then it's better to use itertools.chain.from_iterable here.

>>> from collections import Counter
>>> texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']
>>> c = Counter(word for x in texts for word in x.split())
>>> [' '.join(y for y in x.split() if c[y] > 1) for x in texts]
['Lorem hello ipsum', 'Lorem ipsum', 'hello Lorem ipsum']

Timing comparison:

In [8]: texts = ['Lorem hello ipsum', 'Lorem ipsum generator machine', 'hello Lorem ipsum']

In [9]: huge_texts = [x.split()*100 for x in texts]*1000  #list of lists                            

In [10]: %%timeit
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(huge_texts))
texts = [[word for word in x if c[word]>1] for x in huge_texts]

1 loops, best of 3: 791 ms per loop

In [11]: %%timeit 
all_tokens = sum(huge_texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in huge_texts]                                                        

1 loops, best of 3: 20.4 s per loop 

这篇关于提高性能:删除所有字符串中(大)列表中只出现一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆