Python的内置字典如何实现 [英] How are Python's Built In Dictionaries Implemented

查看:145
本文介绍了Python的内置字典如何实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道python的内置字典类型是如何实现的?我的理解是,它是某种哈希表,但是我无法找到任何确定的答案。

解决方案

p>这里是关于Python dict的一切,我可以把它放在一起(可能比任何人都想知道的更多,但答案是全面的)。




  • Python字典实现为哈希表

  • 哈希表必须允许哈希冲突,即即使两个不同的键具有相同的哈希值,表的实现也必须有一个策略来明确地插入和检索密钥和值对。

  • Python dict 使用开放式寻址来解决哈希冲突(如下所述)(请参阅 dictobject.c:296-297 )。

  • Python散列表只是一个连续的内存块(类似一个数组,所以你可以通过索引来执行一个 O(1)

  • 表格中的每个插槽都可以存储一个条目。这很重要。

  • 每个实际上是三个值的组合:<哈希,键,值> 。这被实现为C结构体(请参阅 dictobject.h:51- 56 )。

  • 下图是Python散列表的逻辑表示。在下图中,左侧的 0,1,...,i,... 是散列表中插槽的索引(它们仅仅是为了说明的目的,并没有与表格一起存储!)。

     #Python哈希表的逻辑模型
    - + ----------------- +
    0 | < hash | key | value> |
    - + ----------------- +
    1 | ... |
    - + ----------------- +
    ... |
    - + ----------------- +
    i | ... |
    - + ----------------- +
    ... |
    - + ----------------- +
    n | ... |
    - + ----------------- +


  • 当一个新的dict被初始化时,它以8个开始。 (请参阅 dictobject.h:49


  • 在表中添加条目时,我们从基于哈希的密钥开始一些插槽 i 。 CPython最初使用 i = hash(key)&面具(其中 mask = PyDictMINSIZE - 1 ,但这并不重要)。请注意,检查的初始插槽 i 取决于密钥的哈希

  • 如果该插槽为空,则将该条目添加到插槽中(按条目,我的意思是< hash | key | value> )。但是如果这个插槽被占用呢?很可能是因为另一个条目具有相同的哈希(哈希冲突!)

  • 如果插槽被占用,CPython(甚至PyPy)将哈希和密钥 (通过比较我的意思是比较不符合 == 比较而不是比较)的当前条目( dictobject.c:337,344-345) )。如果都匹配,那么它认为条目已经存在,放弃并移动到要插入的下一个条目。如果哈希或密钥不匹配,它将开始探测

  • 探测只是意味着它会按插槽搜索,找到一个空的插槽。从技术上讲,我们可以一个接一个地 i + 1,i + 2,... 使用第一个可用的(即线性探测)。但是,由于在评论中精心解释的原因(请参阅 dictobject.c:33 -126 ),CPython使用随机探测。在随机探测中,以伪随机顺序挑选下一个时隙。该条目将添加到第一个空插槽。对于这个讨论,用于选择下一个插槽的实际算法并不重要(参见 dictobject.c:33-126 用于探测的算法)。重要的是,插槽被探测,直到找到第一个空插槽。

  • 同样的事情发生在查找,只是从初始插槽i开始(我依赖于键)。如果哈希和密钥两者都不匹配插槽中的条目,它将开始探测,直到找到匹配的插槽。如果所有插槽都用完,它会报告失败。

  • BTW,如果三分之二满,则 dict 将被调整大小。这样可以避免查找速度减慢。 (请参阅 dictobject.h:64-65



注意:我做了关于Python Dict实现的研究,以回应我自己的问题。我在这里发布了一个稍微编辑的回复版本,因为所有的研究都与这个问题非常相关。


Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.

解决方案

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

  • Python dictionaries are implemented as hash tables.
  • Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.
  • Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
  • Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).
  • Each slot in the table can store one and only one entry. This is important.
  • Each entry in the table actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).
  • The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

    # Logical model of Python Hash table
    -+-----------------+
    0| <hash|key|value>|
    -+-----------------+
    1|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    i|      ...        |
    -+-----------------+
    .|      ...        |
    -+-----------------+
    n|      ...        |
    -+-----------------+
    

  • When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)

  • When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
  • If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
  • If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
  • Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
  • The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
  • BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

这篇关于Python的内置字典如何实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆