Python:如何有效检查项目是否在列表中? [英] Python: how to check that if an item is in a list efficiently?

查看:237
本文介绍了Python:如何有效检查项目是否在列表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串列表(如单词),并且在解析文本时,我需要检查一个单词是否属于当前列表中的单词组.

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

但是,我的输入量很大(大约6亿行),根据Python文档,检查元素是否属于列表是O(n)操作.

However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.

我的代码如下:

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

由于这花费了太多时间(实际上是几天),所以我想改进大部分时间.我看一下Python集合,更确切地说,是双端队列.但是,只有O(1)操作时间可以访问列表的开头和结尾,而不能访问中间.

As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.

有人对如何更好地做到这一点有想法吗?

Do someone has an idea about how to do that in a better way?

推荐答案

您可以考虑使用 trie DAWG 或数据库.有几种相同的Python实现.

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

以下是一些相对的时间安排供您考虑集合与列表:

Here is some relative timings for you to consider of a set vs a list:

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs" 

打印:

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

即,将10000个单词的集合与250,000个单词的集合进行匹配比在相同的250,000个单词的列表中对相同的10000个单词进行匹配的速度 17,085 X .与单独使用未排序列表相比,使用源列表和成员资格测试集要快 28,392 X .

ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

对于成员资格测试,用于查找的列表为O(n),而集和字典为O(1).

For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

结论:对6亿行文本使用更好的数据结构!

Conclusion: Use better data structures for 600 million lines of text!

这篇关于Python:如何有效检查项目是否在列表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆