Python 的内置字典是如何实现的? [英] How are Python's Built In Dictionaries Implemented?

查看:29
本文介绍了Python 的内置字典是如何实现的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道python的内置字典类型是如何实现的?我的理解是它是某种哈希表,但我找不到任何明确的答案.

Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven't been able to find any sort of definitive answer.

推荐答案

这里是我能够整理的有关 Python dict 的所有内容(可能比任何人都想知道的要多;但答案是全面的).

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

  • Python 字典实现为哈希表.

哈希表必须允许哈希冲突,即,即使两个不同的键具有相同的哈希值,该表的实现也必须具有明确插入和检索键值对的策略.

Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table's implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.

Python dict 使用 开放寻址 来解决哈希冲突(如下所述)(参见 dictobject.c:296-297).

Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).

Python 哈希表只是一个连续的内存块(有点像数组,因此您可以按索引进行 O(1) 查找).

Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).

表中的每个槽只能存储一个条目.这很重要.

表中的每个条目实际上是三个值的组合:<散列、键、值 >.这是作为 C 结构实现的(参见 dictobject.h:51-56).

Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).

下图是一个 Python 哈希表的逻辑表示.下图中,左边的0, 1, ..., i, ... 是哈希表中槽位的索引(它们只是为了说明目的,显然不与表一起存储!).

The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

  # Logical model of Python Hash table
  -+-----------------+
  0| <hash|key|value>|
  -+-----------------+
  1|      ...        |
  -+-----------------+
  .|      ...        |
  -+-----------------+
  i|      ...        |
  -+-----------------+
  .|      ...        |
  -+-----------------+
  n|      ...        |
  -+-----------------+

  • 当一个新的 dict 被初始化时,它以 8 个 slots 开头.(参见 dictobject.h:49)

    当向表中添加条目时,我们从一些槽开始,i,它基于键的散列.CPython 最初使用 i = hash(key) &mask(其中 mask = PyDictMINSIZE - 1,但这并不重要).请注意,检查的初始插槽 i 取决于密钥的 hash.

    When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.

    如果该槽为空,则将条目添加到该槽中(通过条目,我的意思是 ).但是如果那个插槽被占用了怎么办!?很可能是因为另一个条目具有相同的哈希值(哈希冲突!)

    If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)

    如果插槽被占用,CPython(甚至 PyPy)比较哈希和键(比较我的意思是 == 比较而不是 is 比较)插槽中的条目与要插入的当前条目的哈希和键(dictobject.c:337,344-345) 分别.如果 both 匹配,那么它认为该条目已经存在,放弃并移动到下一个要插入的条目.如果哈希或密钥不匹配,则开始探测.

    If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.

    探测只是意味着它逐个槽搜索槽以找到一个空槽.从技术上讲,我们可以一个一个地进行,i+1, i+2, ... 并使用第一个可用的(即线性探测).但由于评论中解释得很漂亮的原因(见 dictobject.c:33-126),CPython 使用随机探测.在随机探测中,以伪随机顺序选取下一个时隙.条目被添加到第一个空槽.对于这个讨论,用于选择下一个插槽的实际算法并不重要(参见 dictobject.c:33-126 用于探测算法).重要的是在找到第一个空槽之前对槽进行探测.

    Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.

    查找也会发生同样的事情,只是从初始槽 i 开始(其中 i 取决于键的散列).如果散列和键都与槽中的条目不匹配,则它开始探测,直到找到匹配的槽.如果所有槽都用完了,则报告失败.

    The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.

    顺便说一句,如果 dict 已满三分之二,则将调整其大小.这可以避免减慢查找速度.(参见 dictobject.h:64-65)

    BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

    注意:我对 Python Dict 实现进行了研究,以回应我自己的问题,关于 dict 中的多个条目如何具有相同的哈希值.我在这里发布了稍微编辑过的回复版本,因为所有研究也与这个问题非常相关.

    NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

    这篇关于Python 的内置字典是如何实现的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆