Python dict如何存储密钥,发生冲突时的值? [英] How Python dict stores key, value when collision occurs?

查看:162
本文介绍了Python dict如何存储密钥,发生冲突时的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python如何存储dict键,在hash表中发生冲突时的值?

解决方案

对于普通Python,这个伟大的写作 Praveen Gollakota非常好地解释,这里是重要的位:




  • Python字典被实现为哈希表。哈希表由插槽组成,密钥通过哈希函数映射到插槽。

  • 哈希表实现必须允许哈希冲突,即使两个密钥具有相同的哈希值,执行表必须有一个策略来明确地插入和检索密钥和值对。

  • Python dict使用开放寻址来解决哈希冲突(参见 dictobject.c :296-297)。

  • 在开放寻址中,哈希冲突通过探测来解决(下面解释)。

  • 哈希表只是一个连续的内存块(像一个数组,所以你可以做 O(1 )按索引查找)。

  • 哈希表中的每个插槽都可以存储一个且只有一个条目。这很重要。

  • 表中的每个条目实际上是三个项目的组合 - < hash,key,value> 。这被实现为C结构体(请参阅 dictobject.h : 51-56)。

  • 当一个新的dict被初始化时,它以8个插槽开始。 (请参阅 dictobject.h :49)

  • 在表中添加条目时,我们从基于密钥的哈希的一些插槽开始, i 。 CPython使用初始 i = hash(key)&面具,其中 mask = PyDictMINSIZE - 1 ,但这并不重要。请注意,检查的初始插槽 i 取决于密钥的哈希值。

  • 如果该插槽为空,该条目被添加到插槽(通过条目,我的意思是< hash | key | value> )。但是如果这个插槽被占用呢?最可能是因为另一个条目具有相同的哈希(哈希冲突!)

  • 如果插槽被占用,CPython(甚至PyPy)比较哈希和密钥(通过比较我的意思==相对于插入的当前条目的关键字( 比较) /Objects/dictobject.crel =nofollow> dictobject.c 337,344-345)。如果两者匹配,则认为条目已经存在,放弃并移动到要插入的下一个条目。如果哈希或密钥不匹配,它将开始探测。

  • 探测只是意味着它会按插槽搜索空位以找到空的插槽。从技术上讲,我们可以一个接一个去, i + 1 i + 2 ,...并使用第一个可用的(即线性探测)。但是由于在评论中精美地解释了原因(请参阅 dictobject.c :33-126),CPython使用随机探测。在随机探测中,以伪随机顺序挑选下一个时隙。该条目将添加到第一个空插槽。对于这个讨论,用于选择下一个插槽的实际算法并不重要(参见 dictobject.c :33-126用于探测算法)。重要的是,插槽被探测,直到找到第一个空插槽。

  • 同样的事情发生在查找,只是从初始插槽开始 i (其中 i 取决于密钥的哈希)。如果哈希和密钥两者都不匹配插槽中的条目,它将开始探测,直到找到匹配的插槽。如果所有插槽都已用尽,则会报告失败。

  • 为了避免查找速度减慢,当三分之二满时,dict将被调整大小(请参阅 dictobject.h :64-65)。


How Python stores the dict key, values when collision occurs in hash table? Whats the hash algorithm used to get the hash value here?

解决方案

For the "normal" Python, this great writeup by Praveen Gollakota explains it very well, here are the important bits:

  • Python dictionaries are implemented as hash tables. Hash tables consist of slots, and keys are mapped to the slots via a hashing function.
  • Hash table implementations must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
  • Python dict uses open addressing to resolve hash collisions (see dictobject.c:296-297).
  • In open addressing, hash collisions are resolved by probing (explained below) .
  • The hash table is just a contiguous block of memory (like an array, so you can do O(1) lookup by index).
  • Each slot in the hash table can store one and only one entry. This is important.
  • Each entry in the table actually a combination of the three items - <hash, key, value>. This is implemented as a C struct (see dictobject.h:51-56).
  • When a new dict is initialized, it starts with 8 slots. (see dictobject.h:49)
  • When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask, where mask = PyDictMINSIZE - 1, but that's not really important. Just note that the initial slot, i, that is checked depends on the hash of the key.
  • If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
  • If the slot is occupied, CPython (and even PyPy) compares the hash and the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
  • Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
  • The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
  • To avoid slowing down lookups, the dict will be resized when it is two-thirds full (see dictobject.h:64-65).

这篇关于Python dict如何存储密钥,发生冲突时的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆