为什么字典和集合中的顺序是任意的? [英] Why is the order in dictionaries and sets arbitrary?

查看:29
本文介绍了为什么字典和集合中的顺序是任意的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白循环字典或 Python 中的集合是如何按任意"顺序完成的.

我的意思是,它是一种编程语言,所以语言中的所有内容都必须 100% 确定,对吗?Python 必须有某种算法来决定选择字典或集合的哪一部分,第一、第二等等.

我错过了什么?

解决方案

注意:此答案是在 Python 3.6 中 dict 类型的实现更改之前编写的.此答案中的大部分实现细节仍然适用,但 字典 中键的列表顺序不再由哈希值决定.集合实现保持不变.

顺序不是任意的,而是取决于字典或集合的插入和删除历史,以及具体的Python实现.对于本答案的其余部分,对于字典",您还可以阅读设置";集合被实现为只有键没有值的字典.

键被散列,散列值被分配给动态表中的槽(它可以根据需要增长或缩小).并且该映射过程可能导致冲突,这意味着必须根据已有的键将键插入下一个槽.

列出内容在槽上循环,因此键按照它们当前在表中的顺序列出.

以键 'foo''bar' 为例,假设表大小为 8 个槽.在 Python 2.7 中,hash('foo')-4177197833195190597hash('bar')327024216814240868.Modulo 8,这意味着这两个键被插入插槽 3 和 4 然后:

<预><代码>>>>哈希('富')-4177197833195190597>>>哈希('foo')% 83>>>hash('bar')327024216814240868>>>哈希('条')% 84

这会通知他们的列表顺序:

<预><代码>>>>{'bar':无,'foo':无}{'foo':无,'bar':无}

除 3 和 4 之外的所有槽都是空的,循环遍历表首先列出槽 3,然后是槽 4,所以 'foo' 列在 'bar' 之前.

但是,

barbaz 的哈希值正好相隔 8,因此映射到完全相同的插槽 4:

<预><代码>>>>hash('bar')327024216814240868>>>哈希('baz')327024216814240876>>>哈希('条')% 84>>>哈希('baz')% 84

他们的顺序现在取决于哪个键首先被插入;第二个键必须移动到下一个插槽:

<预><代码>>>>{'baz':无,'bar':无}{'bar':无,'baz':无}>>>{'bar':无,'baz':无}{'baz':无,'bar':无}

这里的表顺序不同,因为一个或另一个键首先被插入.

CPython(最常用的 Python 实现)使用的底层结构的技术名称是 哈希表,一种使用开放寻址的方法.如果你很好奇,并且足够了解 C,请查看 C实施所有(有据可查的)细节.您还可以观看这个 Pycon 2010 由 Brandon Rhodes 发表的演讲,了解 CPython 如何dict 有效,或者拿起一份 Beautiful Code,其中包括由 Andrew Kuchling 编写的关于实现的一章.

请注意,从 Python 3.3 开始,还使用了随机散列种子,使散列冲突不可预测,以防止某些类型的拒绝服务(攻击者通过导致大量散列冲突使 Python 服务器无响应).这意味着给定字典或集合的顺序依赖于当前 Python 调用的随机散列种子.

其他实现可以自由地为字典使用不同的结构,只要它们满足为它们记录的 Python 接口,但我相信到目前为止所有实现都使用哈希表的变体.

CPython 3.6 引入了一个 dict 实现,它维护插入顺序,并且启动速度更快,内存效率更高.新实现不是保留一个大的稀疏表,其中每一行都引用存储的哈希值以及键和值对象,而是添加了一个较小的哈希数组,它只引用单独的密集"表中的索引(一个只包含与实际键值对一样多的行),并且是密集表恰好按顺序列出了包含的项目.请参阅对 Python-Dev 的建议以了解更多详细信息.请注意,在 Python 3.6 中,这被视为实现细节,Python-the-language 没有指定其他实现必须保持顺序.这在 Python 3.7 中发生了变化,其中此详细信息 提升为 语言规范;要使任何实现与 Python 3.7 或更高版本正确兼容,它必须复制此顺序保留行为.并且明确地说:此更改不适用于集合,因为集合已经具有小"哈希结构.

Python 2.7 和更新版本还提供了一个 OrderedDict classdict 的子类,它添加了一个额外的数据结构来记录键顺序.以一定的速度和额外的内存为代价,这个类会记住你插入密钥的顺序;列出键、值或项目将按此顺序列出.它使用存储在附加字典中的双向链表来有效地保持订单最新.请参阅雷蒙德·赫廷格 (Raymond Hettinger) 概述该想法的帖子.OrderedDict 对象还有其他优点,例如可重新排序.

如果你想要一个有序集,你可以安装 oset;它适用于 Python 2.5 及更高版本.

I don't understand how looping over a dictionary or set in python is done by 'arbitrary' order.

I mean, it's a programming language so everything in the language must be 100% determined, correct? Python must have some kind of algorithm that decides which part of the dictionary or set is chosen, 1st, second and so on.

What am I missing?

解决方案

Note: This answer was written before the implementation of the dict type changed, in Python 3.6. Most of the implementation details in this answer still apply, but the listing order of keys in dictionaries is no longer determined by hash values. The set implementation remains unchanged.

The order is not arbitrary, but depends on the insertion and deletion history of the dictionary or set, as well as on the specific Python implementation. For the remainder of this answer, for 'dictionary', you can also read 'set'; sets are implemented as dictionaries with just keys and no values.

Keys are hashed, and hash values are assigned to slots in a dynamic table (it can grow or shrink based on needs). And that mapping process can lead to collisions, meaning that a key will have to be slotted in a next slot based on what is already there.

Listing the contents loops over the slots, and so keys are listed in the order they currently reside in the table.

Take the keys 'foo' and 'bar', for example, and lets assume the table size is 8 slots. In Python 2.7, hash('foo') is -4177197833195190597, hash('bar') is 327024216814240868. Modulo 8, that means these two keys are slotted in slots 3 and 4 then:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

This informs their listing order:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

All slots except 3 and 4 are empty, looping over the table first lists slot 3, then slot 4, so 'foo' is listed before 'bar'.

bar and baz, however, have hash values that are exactly 8 apart and thus map to the exact same slot, 4:

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

Their order now depends on which key was slotted first; the second key will have to be moved to a next slot:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

The table order differs here, because one or the other key was slotted first.

The technical name for the underlying structure used by CPython (the most commonly used Python implemenation) is a hash table, one that uses open addressing. If you are curious, and understand C well enough, take a look at the C implementation for all the (well documented) details. You could also watch this Pycon 2010 presentation by Brandon Rhodes about how CPython dict works, or pick up a copy of Beautiful Code, which includes a chapter on the implementation written by Andrew Kuchling.

Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary or set is then also dependent on the random hash seed for the current Python invocation.

Other implementations are free to use a different structure for dictionaries, as long as they satisfy the documented Python interface for them, but I believe that all implementations so far use a variation of the hash table.

CPython 3.6 introduces a new dict implementation that maintains insertion order, and is faster and more memory efficient to boot. Rather than keep a large sparse table where each row references the stored hash value, and the key and value objects, the new implementation adds a smaller hash array that only references indices in a separate 'dense' table (one that only contains as many rows as there are actual key-value pairs), and it is the dense table that happens to list the contained items in order. See the proposal to Python-Dev for more details. Note that in Python 3.6 this is considered an implementation detail, Python-the-language does not specify that other implementations have to retain order. This changed in Python 3.7, where this detail was elevated to be a language specification; for any implementation to be properly compatible with Python 3.7 or newer it must copy this order-preserving behaviour. And to be explicit: this change doesn't apply to sets, as sets already have a 'small' hash structure.

Python 2.7 and newer also provides an OrderedDict class, a subclass of dict that adds an additional data structure to record key order. At the price of some speed and extra memory, this class remembers in what order you inserted keys; listing keys, values or items will then do so in that order. It uses a doubly-linked list stored in an additional dictionary to keep the order up-to-date efficiently. See the post by Raymond Hettinger outlining the idea. OrderedDict objects have other advantages, such as being re-orderable.

If you wanted an ordered set, you can install the oset package; it works on Python 2.5 and up.

这篇关于为什么字典和集合中的顺序是任意的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆