System.Collections.Generic.Dictionary =终极性能? [英] System.Collections.Generic.Dictionary = Ultimate performance?

查看:196
本文介绍了System.Collections.Generic.Dictionary =终极性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个haXe C#目标,我一直在研究haXe的std库的性能差异,所以我们可以通过其跨平台代码提供最佳性能。



一个很好的例子是散列表代码。我有点不情愿使用.NET的字典,因为它看起来很笨重(由于内存对齐问题,key / value对的结构可能占用大量的内存,除了它所持有的不必要的信息),而且在std图书馆没有像对象哈希这样的东西,我真的以为我可以通过不必调用GetHashCode来挤压一点表现,并且一直在线。



也是清楚的是,Dictionary实现使用一个链表来处理碰撞,这是远非理想的。



所以我们开始实现自己的解决方案,从IntHash(Dictionary)开始,
我们首先实施了 Hopscotch散列,但实际上并没有变得很好,但是很明显,它不会支持非常好的巨大哈希表,因为H通常是机器字,并且随着H / Length的增加,性能越差。



然后,我们跳过了一个 khash 不必要的算法。这一个有很大的潜力,因为它的基准是令人印象深刻的,它处理同一阵列的冲突。它也有一些伟大的事情,像调整大小,而不需要像我们一样记忆两倍。



基准测试令人失望。当然,没有必要说我们的实现的内存使用比Dictionary更低。但是,我希望也能获得一个很好的表现,但不幸的是,情况并非如此。它不是远低于一个数量级 - 但是对于两个集合和获取,.NET的实现仍然表现更好。



所以我的问题是:C#中最好的我试着寻找任何定制解决方案,似乎几乎没有。有一个C5通用集合,但代码是如此凌乱,我甚至没有测试。而且我也没有找到基准。



那么...是吗?我应该只是围绕Dictionary<>?



谢谢!!!

解决方案

我发现,在大多数情况下,.NET 字典表现良好,如果不是很好。这是一个很好的通用实现。我经常遇到的问题是2 GB的限制。在64位系统上,您不能在字典中添加超过约8950万个项目(当该键为整数或引用,该值为引用时)。字典开销似乎是每个项目的24个字节。



这个限制使得自己以非常奇怪的方式知道。 字典似乎增长了一倍 - 当它变满时,它会将容量增加到下一个素数,至少是当前大小的两倍。因此,字典将增长到约4700万,然后抛出异常,因为当它尝试加倍(到9400万)时,内存分配失败(由于2 GB的限制)。通过预先分配 Dictionary 来解决问题(即调用允许您指定容量的构造函数)。这也加速了字典的填充,因为它从来没有增长,这需要分配一个新的数组并重新哈希一切。



什么让你说词典使用链接列表进行冲突解决?我很确定它使用开放式寻址,但我不知道它是如何做的。我猜,如果它是线性探测,那么效果类似于你所得到的链接列表。



我们写了我们自己的 BigDictionary 类可以超过2 GB的限制,发现直接探测的直接开放寻址方案可以提供相当好的性能。它不如字典那么快,但它可以处理数亿件物品(如果我有记忆数十亿)。



那就是说,你可以在某些情况下写一个比一个更快的任务特定的哈希表,这个比较优于.NET字典。但是对于一个通用的哈希表,我想你会比BCL提供的更好。


I'm writing an haXe C# target, and I've been studying performance differences for haXe's std library so we can provide the best performance possible through its cross platform code.

One very good example is for the hash table code. I was a little reluctant about using .NET's Dictionary, as it seems bulky (structs for key/value pairs can take up a huge amount of memory because of memory alignment issues, besides from unnecessary information held by it), and since on the std library there is no such thing as an object hash, I really thought I could squeeze a little performance by not having to call GetHashCode, and inline it all along.

Also it's clear that the Dictionary implementation uses a linked list to deal with collisions, which is far from ideal.

So we started to implement our own solution, starting with IntHash (Dictionary) We first implemented Hopscotch hashing, but it really didn't turn out very well, but it was kind of obvious that it wouldn't support very well huge hash tables, since H is normally a machine word, and as H / Length increases, the poorer the performance.

We then jumped to implement a khash-inspired algorithm. This one had much potential, as its benchmarks are impressive, and it handles collisions on the same array. It had also some great things, like resizing without needing twice as memory as we would.

The benchmarks were disappointing. Of course, there is no need to say that memory usage was much lower on our implementation than Dictionary's. But I was hoping to get a nice performance boost also, but that was not the case, unfortunately. It wasn't too far below - less than an order of magnitude - but for both sets and gets, .NET's implementation still performed better.

So my question is: is that the best we have for C#? I tried looking for any custom solution, and it seems there is almost none. There is that C5 generic collection, but the code is so cluttered I did not even test. And I found no benchmark also.

So... Is that it? Should I just wrap around Dictionary<>?

Thanks!!!

解决方案

I've found that the .NET Dictionary performs well, if not exceptionally well, in most situations. It's a good general purpose implementation. The problem I most often run into is the 2-gigabyte limit. On a 64-bit system, you can't add more than about 89.5 million items to a dictionary (when the key is an integer or a reference, and the value is a reference). Dictionary overhead appears to be 24 bytes per item.

That limit makes itself known in a very odd way. The Dictionary seems to grow by doubling--when it gets full, it increases capacity to the next prime number that's at least double the current size. Because of that, the dictionary will grow to about 47 million and then throw an exception because when it tries to double (to 94 million), the memory allocation fails (due to the 2 gigabyte limit). I get around the problem by pre-allocating the Dictionary (i.e. call the constructor that lets you specify the capacity). That also speeds up populating the dictionary because it never has to grow, which entails allocating a new array and re-hashing everything.

What makes you say that Dictionary uses a linked list for collision resolution? I'm pretty sure it uses open addressing, but I don't know how it does the probes. I guess if it does linear probing, then the effect is similar to what you'd get with a linked list.

We wrote our own BigDictionary class to get past the 2-gigabyte limit and found that a straightforward open addressing scheme with linear probing gives reasonably good performance. It's not as fast as Dictionary, but it can handle hundreds of millions of items (billions if I had the memory).

That said, you should be able to write a faster task-specific hash table that outperforms the .NET Dictionary in some situations. But for a general purpose hash table I think you'll be hard pressed to do better than what the BCL provides.

这篇关于System.Collections.Generic.Dictionary =终极性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆