在Python中打破一个不可变的字典 [英] Hashing an immutable dictionary in Python

查看:168
本文介绍了在Python中打破一个不可变的字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短版本:作为无序项目字典实现的多集群最佳哈希算法是什么?

Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?

我正在尝试哈希一个不可变的multiset(它是其他语言中的一个bag或multiset:像一个数学集合,除了它可以容纳多个元素之一)实现为字典。我创建了标准库类 collections.Counter 的子类,类似于这里的建议: Python hashable dicts ,它推荐一个如下的哈希函数:

I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:

class FrozenCounter(collections.Counter):
    # ...
    def __hash__(self):
        return hash(tuple(sorted(self.items())))

创建完整的项目元组占用大量内存(相对于使用生成器)和哈希将发生在我的应用程序的极度记忆密集的部分。更重要的是,我的字典键(multiset元素)可能无法订购。

Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.

我正在考虑使用这个算法:

I'm thinking of using this algorithm:

def __hash__(self):
    return functools.reduce(lambda a, b: a ^ b, self.items(), 0)

我使用按位XOR表示顺序不重要哈希值不同于元组的散列值?我想我可以在数据的无序元数据流中半实现Python元组哈希算法。请参阅 https://github.com/jonashaag/cpython/blob/master /Include/tupleobject.h (在页面中搜索哈希) - 但我几乎不知道C读取它。

I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.

< STRONG>的思考?建议?谢谢。



如果你想知道为什么我在试图散列一个multiset的混乱:我的问题的输入数据是一组多重集,在每个多集合集合中,每个多集合必须是唯一的,我在一个截止日期工作,我不是一个经验丰富的编码器,所以我想避免在可能的情况下发明新的算法,似乎最多Pythonic的方式来确保我有一些独特的东西是把它们放在一个 set()中,但是这些东西必须是可以哈希的。)

Thoughts? Suggestions? Thanks.


(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)

@marcin和@senderle都给了很多同样的答案:使用 hash(frozenset(self.items()))。这是有道理的,因为 items()views设置为。 @marcin是第一个,但由于对不同解决方案的大O运行时间进行了很好的研究,我给了@senderle的复选标记。 @marcin还提醒我包含一个 __ eq __ 方法 - 但是继承自 dict 的那个将会正常工作。这是我如何实现所有内容 - 基于此代码的进一步的意见和建议是受欢迎的:

Both @marcin and @senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. @marcin was first but I gave the check mark to @senderle because of the good research on the big-O running times for different solutions. @marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:

class FrozenCounter(collections.Counter):
    # Edit: A previous version of this code included a __slots__ definition.
    # But, from the Python documentation: "When inheriting from a class without
    # __slots__, the __dict__ attribute of that class will always be accessible,
    # so a __slots__ definition in the subclass is meaningless."
    # http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
    # ...
    def __hash__(self):
        "Implements hash(self) -> int"
        if not hasattr(self, '_hash'):
            self._hash = hash(frozenset(self.items()))
        return self._hash


推荐答案

由于字典是不可变的,您可以在创建字典时创建哈希,并直接返回。我的建议是创建一个 frozenset 项目(在2.7以上的3+; iteritems ),哈希它,并存储哈希。

Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.

提供一个明确的例子:

>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990

为了澄清为什么我更喜欢一个 frozenset 到一个已排序的项目的元组:a $ code> frozenset 不需要对项目进行排序(因为它们通过在内存中的哈希来稳定排序),因此初始哈希应该在O(n)时间内完成,而不是O(n log n)时间。这可以从 frozenset_hash set_next 实现。

To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items (because they are stably ordered by their hash in memory), and so the initial hash should complete in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.

这篇关于在Python中打破一个不可变的字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆