在此实现中,hashset.contains如何为O(1)? [英] How can hashset.contains be O(1) with this implementation?

查看:57
本文介绍了在此实现中,hashset.contains如何为O(1)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HashSet.在.Net中包含的实现是:

HashSet.Contains implementation in .Net is:

    /// <summary>
    /// Checks if this hashset contains the item
    /// </summary>
    /// <param name="item">item to check for containment</param>
    /// <returns>true if item contained; false if not</returns>
    public bool Contains(T item) {
        if (m_buckets != null) {
            int hashCode = InternalGetHashCode(item);
            // see note at "HashSet" level describing why "- 1" appears in for loop
            for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
                if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
                    return true;
                }
            }
        }
        // either m_buckets is null or wasn't found
        return false;
    }

我在很多地方读到哈希集中的搜索复杂度为O(1)".如何?那为什么会存在for循环?

And I read in a lot of places "search complexity in hashset is O(1)". How? Then why does that for-loop exist?

.net参考链接:推荐答案

哈希表的经典实现是通过根据元素的哈希将元素分配给多个存储桶之一来实现的.如果哈希是完美的,即没有两个元素具有相同的哈希,那么我们将生活在一个完美无缺的世界中,我们无需关心任何事情-任何查找都将始终为O(1) ,因为我们只需要计算哈希值,获取存储桶并说出如果里面有东西.

The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.

我们不是生活在一个完美的世界中.首先,考虑字符串哈希.在.NET中,有(2 ^ 16)^ n个可能的字符串,其长度为 n GetHashCode 返回一个 long ,并且 long 可能有2 ^ 64个值.这足以将每个长度为4的字符串散列为唯一的 long ,但是如果我们希望字符串长于该长度,则必须存在两个提供相同散列的不同值-这称为a碰撞.另外,我们也不想一直保持2 ^ 64个存储桶.解决该问题的通常方法是获取哈希码并计算其值与存储桶的数量取模,以确定存储桶的数量 1 .因此,要点是-我们需要允许发生碰撞.

We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.

所引用的 .NET Framework实现使用最简单的方式来处理冲突- 每个存储桶都包含一个链接导致特定哈希值的所有对象的列表 .您添加对象 A ,它已分配给存储区 i .您添加了对象 B ,它具有相同的哈希,因此它被添加到 A 之后的存储区 i 中的列表中.现在,如果您要查找任何元素,则需要遍历所有对象的列表,并调用实际的 Equals 方法以查找该对象是否实际上是您要查找的对象.这就解释了for循环-在最坏的情况下,您必须浏览整个列表.

The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.

好吧,那么散列集中的搜索复杂度是O(1)"又如何呢?不是.最坏情况下的复杂度与项目数成正比.平均为O(1) . 2 如果所有对象都属于同一存储桶,请在列表末尾索要元素(或索要元素)不在结构中但会属于同一存储桶)的设为O(n).

Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).

那么人们所说的平均为O(1)"是什么意思?该结构监视与桶数成正比的多少个对象,如果超过了某个阈值(称为负载系数),它将重新调整大小.显而易见,这使得平均查找时间与负载因子成正比.

So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.

这就是为什么使散列函数统一很重要的原因,这意味着两个随机选择的不同对象得到分配的相同 long 的概率为1/2 ^ 64 3 .这样可以使哈希表中的对象分布保持一致,因此我们避免了一个桶包含大量项目的病理情况.

That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.

请注意,如果您知道哈希函数和哈希表使用的算法,则可以强制执行这种病理情况和O(n)查找.如果服务器从用户那里获取输入并将其存储在哈希表中,则知道哈希函数和哈希表实现的攻击者可以将其用作DDoS攻击的向量.也可以使用这种方法.以此作为证明,是的,最坏的情况可能是O(n),并且人们通常已经意识到这一点.

Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.

还有许多其他更复杂的哈希表实现方式.如果您有兴趣,则需要自己进行研究.由于查找结构在计算机科学中非常普遍,因此人们提出了各种疯狂的优化方法,这些方法不仅使理论上的操作数量减少,而且使CPU高速缓存未命中率最小化.

There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.

[1]这就是语句 int i = m_buckets [hashCode%m_buckets.Length]-1

[2]至少使用朴素链接的不是.存在具有最坏情况的恒定时间复杂度的哈希表.但是通常,与理论上(在时间复杂度方面)较慢的实现相比,它们在实践中更糟糕,这主要是由于CPU高速缓存未命中.

[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.

[3]我假设可能的散列的域是所有 long 的集合,所以有2 ^ 64个,但是我写的所有内容都归纳为任何其他非空,是一组有限的值.

[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

这篇关于在此实现中,hashset.contains如何为O(1)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆