有效识别大型列表中的重复项(500,000+) [英] Efficiently identify duplicates in large list (500,000+)

查看:54
本文介绍了有效识别大型列表中的重复项(500,000+)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的DOI,并且我需要最有效的方法来识别重复的DOI(即,打印出重复的值的索引和DOI.)DOI的数组可能包含500,000 + DOI的.我目前的方法是这样的(受此答案启发):

I have a large list of DOI's and I need the most efficient way to identify the DOI's which are repeated (ie. print out the index and the DOI for values which are repeated.) The array of DOI's could consist of 500,000 + DOI's. My current approach is this (inspired by this answer):

from collections import defaultdict
D = defaultdict(list)
for i,item in enumerate(doiList):
    D[item].append(i)
D = {k:v for k,v in D.items() if len(v)>1}
print (D)

是否有更有效的处理方法?

Is there a more processing efficient way of doing this?

样本DOI列表:

doiList = ['10.1016/j.ijnurstu.2017.05.011 [doi]','10.1016/j.ijnurstu.2017.05.011 [doi]' ,'10.1167/iovs.16-20421 [doi]', '10.1093/cid/cix478 [doi]', '10.1038/bjc.2017.133 [doi]', '10.3892/or.2017.5646 [doi]', '10.1177/0961203317711009 [doi]', '10.2217/bmm-2017-0087 [doi]', '10.1007/s12016-017-8611-x [doi]', '10.1007/s10753-017-0594-5 [doi]', '10.1186/s13601-017-0150-2 [doi]', '10.3389/fimmu.2017.00515 [doi]', '10.2147/JAA.S131506 [doi]', '10.2147/JAA.S128431 [doi]', '10.1038/s41598-017-02293-z [doi]', '10.18632/oncotarget.17729 [doi]', '10.1073/pnas.1703683114 [doi]', '10.1096/fj.201600857RRR [doi]', '10.1128/AAC.00020-17 [doi]', '10.1016/j.jpain.2017.04.011 [doi]', '10.1016/j.jaip.2017.04.029 [doi]', '10.1016/j.anai.2017.04.021 [doi]', '10.1016/j.alit.2017.05.001 [doi]']

推荐答案

尝试将它们存储在

Try storing them in a set instead. You can append the duplicates to a single list, which might speed things up:

seen = set()
dupes = []

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen.add(doi)
    else:
        dupes.append(i)

此时,seen包含所有不同的doi值,而dupes包含所有重复值的第二,第三等索引.您可以在doiList中查找它们,以确定哪个索引对应于哪个值.

At this point, seen contains all the distinct doi values, while dupes contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList to determine which index corresponds to which value.

要从中获得更多性能,可以缓存方法:

To get some more performance out of this, you can cache the methods:

seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen_add(doi)
    else:
        dupes_append(i)

这篇关于有效识别大型列表中的重复项(500,000+)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆