为什么HashSet< T>类不用于实现Enumerable.Distinct [英] Why HashSet<T> class is not used to implement Enumerable.Distinct

查看:75
本文介绍了为什么HashSet< T>类不用于实现Enumerable.Distinct的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要以大O表示法访问IEnumerable.Distinct的渐近时间和空间复杂度

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation

所以我正在研究扩展方法 Set<T> ,这几乎是具有开放地址"的哈希表的经典实现

So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"

很快引起注意的是 HashSet<T> ,但有一些遗漏

What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions

但是,此简化的 Set<T> 实现存在一些明显的缺陷,例如 Resize 方法不使用素数作为插槽的大小,例如

However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime

所以,我的问题是:

  1. 此处重复代码的原因是什么,为什么不坚持 DRY 原则?特别是考虑到这两个类都在同一程序集中System.Core
  2. 它看起来像 HashSet<T> 的性能会更好,因此我应该避免使用Distinct扩展方法,而编写自己的扩展方法,该方法将使用
  1. What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
  2. It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?

推荐答案

这几乎是带有开放地址"的哈希表的经典实现

which is almost a classical implementation of a hash table with "open addressing"

再看一次.它与列表头单元格分开链接.虽然所有插槽都在一个阵列中,但是在发生冲突的情况下查找下一个插槽是通过检查当前插槽的next字段来完成的.与在每个节点上将链表用作单独的堆对象相比,此方法具有更好的缓存效率,尽管在这方面不如开放式寻址那样好.同时,它避免了某些情况下开放式寻址的效果不佳.

Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.

Set中的许多代码只是HashSet中的复制粘贴,有一些遗漏

a lot of code in Set is just a copy-paste from HashSet, with some omissions

AFAICT使用哈希集的私有实现的原因是EnumerableHashSet大约同时独立开发.这只是我的推测,但是它们都是在.NET 3.5中引入的,因此是可行的.

AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.

HashSet<T>很可能是通过复制Set<T>开始的,然后使其更好地服务于公开展示,尽管这两者也有可能都基于与列表头单元分开链接的相同原理

It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells

在性能方面,HashSet使用质数表示它更有可能避免与较差的哈希值发生冲突(但是,究竟有多少优势,这不是一个简单的问题),但是Set在有很多方法,特别是在.NET Core中,其中不需要的一些东西已被删除.特别地,该版本的Set利用了以下事实:一旦删除了某个项目(例如在Intersect期间发生),就永远不会添加任何项目,这使它可以省去freelist和与之相关的任何工作,HashSet无法做.即使最初的实现也很轻松,因为它不跟踪版本以在枚举过程中捕获更改,这虽然花费很小,但是却为每次添加和删除操作都付出了代价.

In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.

同样地,对于具有不同哈希码分布的不同数据集,有时一个性能更好,有时另一个更好.

As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.

特别是考虑到这两个类都在同一程序集中System.Core中

Especially given the fact that both of these classes are in the same assembly System.Core

仅在某些.NET版本中,在某些版本中它们处于单独的程序集中.在.NET Core中,我们有两个版本的Set<T>,一个版本的程序集具有System.Linq,另一个版本的程序集具有System.Linq.Expressions.前者如上所述进行了修整,后者被HashSet<T>取代,因为它在该处的作用较小.

Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.

当然,System.Core排在第一位,但是这些元素可以完全分开的事实说明System.Core并不是相互依赖的单一整体.

Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.

.NET Core的Linq版本中现在有一个ToHashSet()方法,这使得用HashSet<T>替换Set<T>的可能性更加合理,尽管这很容易.我认为@ james-ko正在考虑测试这样做的好处.

That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think @james-ko was considering testing the benefits of doing that.

HashSet<T>看起来会更好

由于上述原因,可能并非如此,尽管确实如此,这取决于源数据.在此之前,没有考虑过几种不同的linq方法的优化(在linq的初始版本中并不多,但在.NET Core中很少.)

For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).

所以我应该避免使用Distinct扩展方法,而编写自己的扩展方法,该方法将使用HashSet<T>而不是Set<T>.

so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.

使用Distinct().如果您遇到瓶颈,则可能是HashSet<T>将以给定的数据集获胜,但是如果您尝试确保配置文件与实际值紧密匹配,则代码将在现实生活中遇到.毫无疑问,如果您的应用程序遇到了另一种方法更好的情况,那么就可以基于某些任意测试来确定一种方法是更快的方法. (并且,如果我发现了这个问题点,那么我将首先考虑是否可以提高所讨论类型的GetHashCode()的速度或位分配).

Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

这篇关于为什么HashSet&lt; T&gt;类不用于实现Enumerable.Distinct的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆