计数Python字典中的冲突 [英] Counting collisions in a Python dictionary

查看:220
本文介绍了计数Python字典中的冲突的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第一次在这里发帖,所以希望我以正确的方式问我的问题,

my first time posting here, so hope I've asked my question in the right sort of way,

在Python字典中添加元素后,是不是有可能让Python告诉你,添加该元素是否导致了冲突? (在找到一个放置元素的位置之前,碰撞解决策略探测了多少个位置?)

After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)

我的问题是:我正在使用字典作为一个较大的项目的一部分,经过广泛的分析,我发现代码中最慢的部分是处理使用字典实现的稀疏距离矩阵。

My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.

我使用的密钥是Python对象是唯一的整数,所以我知道它们都是哈希到不同的值。但是将它们放在字典中,原则上仍然可能引起冲突。我不相信字典碰撞是减慢我的程序的事情,但是我想从我的查询中消除它们。

The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.

所以,例如,给定以下字典:

So, for example, given the following dictionary:

d = {}
for i in xrange(15000):
    d[random.randint(15000000, 18000000)] = 0

你可以让Python告诉你创建时发生了多少冲突它是吗?

can you get Python to tell you how many collisions happened when creating it?

我的实际代码与应用程序纠缠在一起,但上述代码使得与我使用的字典非常相似的字典。

My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.

重复:我不认为碰撞是减慢代码的速度,我只想通过显示我的字典没有多少可以消除这种可能性

To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.

感谢您的帮助。

编辑 @Winston Ewert的解决方案:

Some code to implement @Winston Ewert's solution:

n = 1500
global collision_count
collision_count = 0

class Foo():

    def __eq__(self, other):
        global collision_count
        collision_count += 1
        return id(self) == id(other)

    def __hash__(self):
        #return id(self) # @John Machin: yes, I know!
        return 1

objects = [Foo() for i in xrange(n)]

d = {}
for o in objects:
    d[o] = 1

print collision_count

请注意,当你在类上定义 __ eq __ ,如果您还没有定义<$ code> TypeError,则可以使用 TypeError:unhashable instance c $ c> __ hash __ function。

Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.

它的运行不如我预期的那样。如果您有 __哈希__ 函数返回1 ,那么您会收到如下所示的冲突负载(1125560 n = 1500我的系统)。但是,使用 return id(self),有0个冲突。

It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.

任何人都知道为什么这是说0碰撞?

Anyone know why this is saying 0 collisions?

编辑:
我可能已经弄清楚了。

I might have figured this out.

是因为 __ eq __ 仅在 __哈希__

Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as @John Machin put it)?

推荐答案

strong>简短的答案:

Short answer:

通过使用随机整数作为dict键,您无法模拟使用对象ids作为dict键。他们有不同的散列函数。

You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.

碰撞确实发生。

您不应该担心碰撞事件。

You shouldn't be worrying about collisions.

长回答

一些解释,源自阅读源代码

被实现为2 ** i条目的表,其中我是一个整数。

A dict is implemented as a table of 2 ** i entries, where i is an integer.

dicts不超过2/3满。因此,对于15000个密钥,我必须是15和2 **我是32768。

dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.

当o是一个没有定义<$ c $的类的任意实例c> __ hash __()哈希(o)== id(o)不是。由于地址在低位3或4位可能具有零,所以通过将地址向右旋转4位来构造散列;请参阅源文件Objects / object.c ,函数 _Py_HashPointer

When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer

如果低位中有很多零,这将是一个问题,因为要访问大小为2 ** i(例如32768)的表,必须对散列值(通常大于该值)进行调整以适应,并且通过采用低阶i(例如15)位来完成非常简单和快速的完成的哈希值。

It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.

因此碰撞是不可避免的

但这是不会引起恐慌。散列值的剩余位被计入下一个探针将在哪里的计算。需要第三次探测的可能性应该相当小,特别是因为dict不超过2/3。通过计算第一个和后续探针的插槽的便宜成本,可以减轻多个探针的成本。

However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.

下面的代码是一个简单的实验,说明了大部分上述讨论。它假定dict已经达到其最大大小后随机访问。使用Python2.7.1,它可以显示15000个对象的大约2000次冲突(13.3%)。

The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).

无论如何,你应该把注意力转移到其他地方。碰撞不是你的问题,除非你已经获得了一些非常异常的方法来获取你的对象的记忆。你应该看看你如何使用这些例子。在d 中使用 k,或尝试/除外,而不是 d.has_key(k)。考虑以 d [(x,y)] 访问的一个dict,而不是以 d [x] [y] 。如果您需要帮助,请问一个单独的问题。

In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.

更新后,测试Python 2.6:

Update after testing on Python 2.6:

旋转地址直到Python 2.7才引入;有关综合讨论和基准测试,请参见此错误报告。基本结论是IMHO仍然有效,可以通过更新如果可以来增加。

Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".

>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
...    i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
...     pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>

这篇关于计数Python字典中的冲突的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆