Python多处理列表字典比较 [英] Python Multiprocessing list-dictionary comparison

查看:249
本文介绍了Python多处理列表字典比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含70万个项目的列表,一个字典包含30万个密钥。一些300k的密钥包含在列表中存储的700k个项目中。
现在,我已经建立了一个简单的比较和处理循环:

 #列表包含大约700k行 -  ids,firstname ,lastname,email,lastupdate 
list = open(r'myfile.csv','rb')readlines()
dictionary = {}
#字典包含300k ID密钥
词典[someID] = {'first':'john',
'last':'smith',
'email':'john.smith@gmail.com',
' lastupdate':datetime_object}
列表中的行:
id,firstname,lastname,email,lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate, '%Y-%m-%d%H:%M:%S')
如果在dictionary.keys()中的id:
#更新字典[id]的键:值
如果lastupdate>字典[id] ['lastupdate']:
#更新字典[id]中的值
else:
#在字典中创建新的id并用键填充值

我希望加快一点,并为这种工作使用多处理。为此,我想我可以将列表分成四个较小的列表,Pool.map每个列表,并单独检查它们,我将使用四个进程中的每一个来创建四个新的字典。问题是,为了创建一个具有上次更新值的整个字典,我将不得不重复使用4个新创建的字典等等。



有没有人经历过有这样的问题,并有解决方案或想法的问题?



谢谢

解决方案

 如果在dictionary.keys()中的id:

否!请不要!这是一个O(n)操作! 只需

 如果字典中的id为

需要O(1)时间!!!



在思考之前使用多处理等你应该避免这个真的效率低下的操作。如果字典有300k键,那行可能是瓶颈。






我假设python2;如果不是这样,那么你应该使用中的python-3.X 。在python3中,使用dictionary.keys()中的键为O(1),因为 .keys()现在返回一个视图而不是密钥列表,但是 仍然要稍微省略一点,省略 .keys()


I have a list contains 700,000 items and a dictionary contains 300,000 keys. Some of the 300k keys are contained within the 700k items stored in the list. Now, I have built a simple comparison and handling loop:

# list contains about 700k lines - ids,firstname,lastname,email,lastupdate
list = open(r'myfile.csv','rb').readlines()
dictionary = {}
# dictionary contains 300k ID keys
dictionary[someID] = {'first':'john',
                      'last':'smith',
                      'email':'john.smith@gmail.com',
                      'lastupdate':datetime_object}
for line in list:
    id, firstname, lastname, email, lastupdate = line.split(',')
    lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
    if id in dictionary.keys():
        # update dictionary[id]'s keys:values
        if lastupdate > dictionary[id]['lastupdate']:
            # update values in dictionary[id]
    else:
        # create new id inside dictionary and fill with keys:values

I wish to speed things up a little and use multiprocessing for this kind of job. For this, I thought I could split the list to four smaller lists, Pool.map each list and check them separately with each of the four processes I'll make to create four new dictionaries. Problem is that in order create one whole dictionary with last updated values, I will have to repeat the process with the 4 new created dictionaries and so on.

Have anyone ever experienced with such problem and have a solution or an idea for that problem?

Thanks

解决方案

if id in dictionary.keys():

NO! Please No! This is an O(n) operation!!! The right way to do it is simply

if id in dictionary

which takes O(1) time!!!

Before thinking about using multiprocessing etc you should avoid this really inefficient operations. If the dictionary has 300k keys that line was probably the bottleneck.


I have assumed python2; if this is not the case then you should use the . In python3 using key in dictionary.keys() is O(1) because .keys() now returns a view of the dict instead of the list of keys, however is still a bit faster to omit .keys().

这篇关于Python多处理列表字典比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆