Python的基础散列数据结构词典 [英] Python's underlying hash data structure for dictionaries

查看:158
本文介绍了Python的基础散列数据结构词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立一个非常大的字典和我执行许多检查是否一个关键是在结构,然后添加,如果它独特或递增计数器,如果它是相同的。

I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.

Python使用以存储字典一个散列数据结构(不要与密码散列函数混淆)。查找是O(1),但如果哈希表是满的它必须被重新处理,这是非常昂贵的。

Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.

我的问题是,我会过得更好使用 AVL二叉搜索树或者是一个哈希表不够好?

My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?

推荐答案

,以确保将实现这两​​种检查,但我知情的猜测是,该词典会更快,因为二叉搜索树有成本的唯一途径为O(log(n))的用于查找和插入,我认为除非情况(如大规模的哈希冲突)的哈希表的O(1)查找将超过偶尔调整大小。最pessimal

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.

如果你看一下在 Python字典实现,你会看到:

If you take a look at the Python dictionary implementation, you'll see that:

  1. 字典开始时有8项( PyDict_MINSIZE );
  2. 在一个5万或更少项目的字典四倍大小时,它的增长;
  3. 在一个有5万多个项目的字典大小加倍时,它的增长;
  4. 键哈希缓存在字典中,因此不会重新计算时,字典大小。
  1. a dictionary starts out with 8 entries (PyDict_MINSIZE);
  2. a dictionary with 50,000 or fewer entries quadruples in size when it grows;
  3. a dictionary with more than 50,000 entries doubles in size when it grows;
  4. key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.

(以下简称洪兴祖的优化字典都值得一读也。)

(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)

所以,如果你的字典有1,000,000条记录,我相信它会被调整了十一次(8→32→128→512→2048→8192→32768→131072→262144→524288→1048576→2097152)在2009768成本在调整大小额外的插入。这似乎可能比的所有参与百万插入到AVL树的再平衡的成本要低得多。

So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

这篇关于Python的基础散列数据结构词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆