提高Python中非常大的字典的性能 [英] Improving performance of very large dictionary in Python

查看:501
本文介绍了提高Python中非常大的字典的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现,如果我在开始时初始化一个空字典,然后在for循环中添加元素到字典(大约110,000个键,每个键的值是一个列表,也在循环中增加),速度像循环一样下降。



我怀疑问题是,字典不知道初始时间的密钥数量,它没有做一些非常聪明的事情,所以也许存储冲突变得相当经常,它会减慢。



如果我知道键的数量,究竟是什么键,python中是否有任何方式来创建一个dict(或一个哈希表)工作效率更高我模糊地记得,如果你知道密钥,你可以巧妙地设计散列函数(完美的哈希),并预先分配空间。

解决方案


如果我知道密钥的数量,这些密钥是什么,是否有
任何方式在python使一个dict(或一个哈希表)工作更多
有效?我隐约地记得,如果你知道密钥,你可以
巧妙地设计散列函数(完美的哈希),并事先分配
空间。


Python不会公开预定义选项来加快字典的增长阶段,也不会直接控制字典中的位置。



这就是说,如果这些键总是提前知道的,你可以将它们存储在一个 设置 ,并使用 dict.fromkeys() 。该类方法是根据设置的大小进行优化以预定义词典,它可以填充字典,而不需要对__hash __()的任何新调用:

 >>> keys = {'red','green','blue','yellow','orange','pink','black'} 
>>> d = dict.fromkeys(keys)#dict的大小为32个空插槽

如果减少冲突是您的目标,您可以在字典中的插入顺序上运行实验以最小化堆积。 (请查看 Brent对算法D的变化在Knuth的TAOCP中了解如何做到这一点)。



通过为词典调用纯Python模型(例如这个),可以计算替代插入顺序的加权平均数量的探针。例如,插入 dict.fromkeys([11100,22200,44400,33300])每次查找平均为1.75个探测器。 dict.fromkeys([33300,22200,11100,44400])



<另一个诀窍是通过将其置于增加其大小而不添加新的密钥 s:

  d = dict.fromkeys(['red ','绿色','蓝色','黄色','橙色'])
d.update(dict(d))#这样可以增加额外的键
#自由。

最后,您可以为您的密钥介绍自己的自定义__hash __(),目的是消除所有冲突(可能使用完美的哈希生成器,例如 gperf )。


I find that if I initialize an empty dictionary at the beginning, and then adding elements to the dictionary in a for loop (about 110,000 keys, the value for each key is a list, also increasing in the loop), the speed goes down as for loop goes.

I suspect that the problem is, the dictionary does not know the number of keys at init time and it is not doing something very smart, so perhaps the storage collision becomes quite often and it slows down.

If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand.

解决方案

If I know the number of keys and exactly what are those keys, is there any way in python to make a dict (or a hashtable) work more efficiently? I vaguely remember that if you know the keys, you can design the hash function smartly (perfect hash?) and allocate the space beforehand.

Python doesn't expose a pre-sizing option to speed-up the "growth phase" of a dictionary, nor does it provide any direct controls over "placement" in the dictionary.

That said, if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys(). That classmethod is optimized to pre-size the dictionary based on the set size and it can populate the dictionary without any new calls to __hash__():

>>> keys = {'red', 'green', 'blue', 'yellow', 'orange', 'pink', 'black'}
>>> d = dict.fromkeys(keys)  # dict is pre-sized to 32 empty slots

If reducing collisions is your goal, you can run experiments on the insertion order in the dictionary to minimize pile-ups. (Take a look at Brent's variation on Algorithm D in Knuth's TAOCP to get an idea of how this is done).

By instrumenting a pure Python model for dictionaries (such as this one), it is possible to count the weighted-average number of probes for an alternative insertion order. For example, inserting dict.fromkeys([11100, 22200, 44400, 33300]) averages 1.75 probes per lookup. That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]).

Another "trick" is to increase spareness in a fully populated dictionary by fooling it into increasing its size without adding new keys:

 d = dict.fromkeys(['red', 'green', 'blue', 'yellow', 'orange'])
 d.update(dict(d))     # This makes room for additional keys
                       # and makes the set collision-free.

Lastly, you can introduce your own custom __hash__() for your keys with the goal of eliminating all collisions (perhaps using a perfect hash generator such as gperf).

这篇关于提高Python中非常大的字典的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆