为什么 -1 和 -2 在 CPython 中都散列为 -2? [英] Why do -1 and -2 both hash to -2 in CPython?

查看:20
本文介绍了为什么 -1 和 -2 在 CPython 中都散列为 -2?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

可能的重复:
什么时候python 对象的哈希计算,为什么-1 的哈希不同?

如果使用 Python,为什么 -1-2 都会散列到相同的数字?

既然这样做了,Python 是如何区分这两个数字的?

<预><代码>>>>-1 是 -2错误的>>>哈希(-1)是哈希(-2)真的>>>哈希(-1)-2>>>哈希(-2)-2

解决方案

-1 是 CPython 的 C 级别的保留值,它阻止散列函数能够产生 <代码>-1.正如 DSM 所指出的,在 IronPython 和 PyPy 中情况并非如此,其中 hash(-1) != hash(-2).

参见 这个 Quora 答案:

<块引用>

如果您在 C 扩展模块中编写类型并提供 tp_hash方法,你必须避免 -1 — 如果你返回 -1,Python 会假设你是想抛出一个错误.

如果你用纯 Python 编写一个类并提供一个 __hash__ 方法,谢天谢地,没有这样的要求.但那是因为 C 代码调用您的 __hash__ 方法为您执行此操作 - 如果您的__hash__ 返回 -1,然后将 hash() 应用于您的对象实际上将返回 -2.

实际上只是重新打包了来自 effbot 的信息:

<块引用>

散列值 -1 是保留的(它用于标记 C 中的错误执行).如果哈希算法生成这个值,我们只需使用 -2 代替.

您也可以在源代码中看到这一点.例如对于 Python 3 的 int 对象,它位于 哈希实现:

if (x == (Py_uhash_t)-1)x = (Py_uhash_t)-2;返回 (Py_hash_t)x;

<小时><块引用>

既然这样做了,Python 是如何区分这两个数字的?

因为所有的哈希函数都将一个大的输入空间映射到一个更小的输入空间,所以无论哈希函数有多好,总是会出现冲突.例如,考虑散列字符串.如果哈希码是 32 位整数,则您有 2^32(略多于 40 亿个)哈希码.如果考虑所有长度为 6 的 ASCII 字符串,则输入空间中有 (2^7)^6(不到 4.4 万亿)个不同的项目.只有这一套,无论你有多好,你都可以保证有很多很多的碰撞.添加 Unicode 字符和无限长度的字符串!

因此,哈希码仅提示对象所在的位置,接下来是等式测试来测试候选键.要在散列表集中实施成员资格测试,散列代码会为您提供存储桶"编号以在其中搜索值.但是,所有具有相同哈希码的集合项都在存储桶中.为此,您还需要进行相等性测试以区分桶中的所有候选者.

关于可哈希对象的 CPython 文档

一>.在其他语言/框架中,有一个准则/规则,如果您提供自定义哈希码函数,则还必须提供自定义相等性测试(在与哈希码函数相同的字段上执行).

<小时>

事实上,今天的 Python 版本正是解决了这个问题,当这个(相同的哈希值,但大规模)被用作拒绝服务攻击时,一个安全补丁解决了效率问题 - http://mail.python.org/pipermail/python-list/2012-April/1290792.html

Possible Duplicate:
When is a python object's hash computed and why is the hash of -1 different?

Why do -1 and -2 both hash to the same number if Python?

Since they do, how does Python tell these two numbers apart?

>>> -1 is -2
False
>>> hash(-1) is hash(-2)
True
>>> hash(-1)
-2
>>> hash(-2)
-2

解决方案

-1 is a reserved value at the C level of CPython which prevents hash functions from being able to produce a hash value of -1. As noted by DSM, the same is not true in IronPython and PyPy where hash(-1) != hash(-2).

See this Quora answer:

If you write a type in a C extension module and provide a tp_hash method, you have to avoid -1 — if you return -1, Python will assume you meant to throw an error.

If you write a class in pure Python and provide a __hash__ method, there's no such requirement, thankfully. But that's because the C code that invokes your __hash__ method does that for you — if your __hash__ returns -1, then hash() applied to your object will actually return -2.

Which really just repackages the information from effbot:

The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead.

You can also see this in the source. For example for Python 3’s int object, this is at the end of the hash implementation:

if (x == (Py_uhash_t)-1)
    x = (Py_uhash_t)-2;
return (Py_hash_t)x;


Since they do, how does Python tell these two numbers apart?

Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!

Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.

This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).


Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html

这篇关于为什么 -1 和 -2 在 CPython 中都散列为 -2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆