使用此实现,hashset.contains 如何成为 O(1)? [英] How can hashset.contains be O(1) with this implementation?

查看:23
本文介绍了使用此实现,hashset.contains 如何成为 O(1)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HashSet.Contains 在 .Net 中的实现是:

HashSet.Contains implementation in .Net is:

    /// <summary>
    /// Checks if this hashset contains the item
    /// </summary>
    /// <param name="item">item to check for containment</param>
    /// <returns>true if item contained; false if not</returns>
    public bool Contains(T item) {
        if (m_buckets != null) {
            int hashCode = InternalGetHashCode(item);
            // see note at "HashSet" level describing why "- 1" appears in for loop
            for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
                if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
                    return true;
                }
            }
        }
        // either m_buckets is null or wasn't found
        return false;
    }

而且我在很多地方都读到哈希集中的搜索复杂度是 O(1)".如何?那为什么会存在for循环呢?

And I read in a lot of places "search complexity in hashset is O(1)". How? Then why does that for-loop exist?

.net 参考链接:https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs

推荐答案

哈希表的经典实现是根据元素的哈希值将元素分配给多个存储桶之一.如果散列是完美的,即没有两个元素具有相同的散列,那么我们将生活在一个完美的世界里,我们不需要关心任何事情——任何查找都是 O(1) 总是,因为我们只需要计算散列,获取存储桶并说是否里面有东西.

The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.

我们并不是生活在一个完美的世界里.首先,考虑字符串散列.在.NET中,有(2^16)^n个长度n的可能字符串;GetHashCode 返回一个 longlong 有 2^64 个可能的值.这足以将每个长度为 4 的字符串散列为唯一的 long,但是如果我们想要更长的字符串,必须存在两个给出相同散列的不同值 - 这称为碰撞.此外,无论如何我们都不想一直维护 2^64 个存储桶.通常的处理方法是获取哈希码并以桶数为模计算其值,以确定桶的编号1.所以,要点是 - 我们需要允许碰撞.

We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.

引用的 .NET Framework 实现 使用最简单的处理冲突的方法 - 每个桶都包含一个链表导致特定散列的所有对象.你添加对象A,它被分配到一个bucket i.您添加对象B,它具有相同的哈希值,因此它被添加到A 之后的i 桶中的列表中.现在,如果您要查找任何元素,则需要遍历所有对象的列表并调用实际的 Equals 方法来确定该对象是否确实是您要查找的对象.这解释了 for 循环 - 在最坏的情况下,您必须遍历整个列表.

The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.

好的,那么hashset 中的搜索复杂度是 O(1)"呢?不是.最坏情况的复杂度与项目的数量成正比.平均为 O(1) .2 如果所有对象都落入同一个桶中,则请求列表末尾的元素(或不在结构中但会落入同一个桶)是 O(n).

Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).

那么人们所说的平均为 O(1)"是什么意思?该结构监控有多少对象与桶的数量成正比,如果超过某个阈值,称为负载因子,它会调整大小.很容易看出,这使得平均查找时间与负载因子成正比.

So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.

这就是为什么散列函数统一很重要,这意味着两个随机选择的不同对象得到相同的long分配的概率是1/2^643.这使哈希表中的对象分布保持一致,因此我们避免了一个桶包含大量项目的病态情况.

That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.

请注意,如果您知道哈希表使用的哈希函数和算法,则可以强制进行这种病态情况和 O(n) 查找.如果服务器从用户那里获取输入并将它们存储在哈希表中,则知道哈希函数和哈希表实现的攻击者可以将其用作 DDoS 攻击的向量.也有解决方法.将此视为证明,是的,最坏的情况可能是 O(n),并且人们普遍意识到这一点.

Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.

还有许多其他更复杂的哈希表实现方式.如果你有兴趣,你需要自己研究.由于查找结构在计算机科学中如此普遍,人们提出了各种疯狂的优化方法,不仅可以最大限度地减少理论上的操作次数,还可以最大限度地减少 CPU 缓存未命中等问题.

There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.

[1] 这正是语句 int i = m_buckets[hashCode % m_buckets.Length] - 1

[2] 至少那些使用朴素链接的不是.存在具有最坏情况恒定时间复杂度的哈希表.但与理论上(在时间复杂度方面)较慢的实现相比,它们在实践中通常更糟,主要是由于 CPU 缓存未命中.

[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.

[3] 我假设可能的散列域是所有 long 的集合,所以有 2^64 个,但我写的所有东西都概括为任何其他非空, 有限值集.

[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

这篇关于使用此实现,hashset.contains 如何成为 O(1)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆