C ++映射插入和查找性能和存储开销 [英] C++ map insertion and lookup performance and storage overhead

查看:84
本文介绍了C ++映射插入和查找性能和存储开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想存储一个 integer 键到内存中的 float 值的映射。



我有大约1.3亿个键(以及相应的130万个值)。



查找性能 - 我必须做很多,数百万的查找。



C ++ STL库对此类关联数组有一个 map 类。我有几个关于 map 的问题。



map 用于上述大小的数据集?一般来说, map



它看起来像 map 是一个红黑平衡的二叉树。这听起来像是现实世界中的效果,这是 O(log n)用于插入和检索。



它提示了 O(1)的暗示插入。我的输入是预先排序,所以我相信我应该能够提供插入事件的提示。如何使用此处列出的方法提供此提示?



是否有提供更好查找性能的STL容器?



还有其他公开可用的开源框架,如果编写我自己的容器类将提供更好的查找,那么它会比STL map


性能,我可以研究什么数据结构?



我使用GCC 4执行此任务,在Linux或Mac OS X下运行。



如果这些是愚蠢的问题,我先提出道歉。谢谢您的建议。

解决方案

根据您的说法,我会非常努力地使用 std :: vector< pair< int,float> > ,并使用 std :: lower_bound std :: upper_bound std :: equal_range 来查找值。



std :: map 可以(和确实)不同,有很少或没有问题的房间,它通常会消耗额外的内存查找值更多慢于在向量中的二分搜索。正如你所注意到的,它通常(并且几乎不可避免地)被实现为某种平衡的树,它对指针和平衡信息施加开销,通常意味着每个节点也被单独分配。由于你的节点相当小(通常为8字节),额外的数据可能至少与你实际存储的一样多(即至少100%的开销)。单独的分配通常意味着参考的局部性较差,这导致缓存使用率较低。



编辑:仅查看 std :: map ,可能值得注意的是,大多数使用红黑树。如果你要使用 std :: map ,使用AVL树的实现可能更适合你的目的 - 一个AVL树对平衡有稍微严格的约束。这给出稍微更快的查找,代价是稍慢的插入和删除(因为它必须更多地平衡以维持其对平衡的更严格的解释)。但是,只要您的数据在使用过程中保持不变, std :: vector 仍然几乎肯定会更好。



另一个值得注意的可能性:如果你的键至少相当地均匀分布,你可能想尝试使用内插而不是二分。即,不是总是从向量的中间开始,而是进行线性插值,以在查找的最可能的起点处进行猜测。当然,如果你的键遵循一些已知的非线性分布,你可以使用匹配的插值。



编辑2:假设键合理均匀分布,搜索具有O(log log N)的复杂度。对于130万个键,这工作到大约4探测,以找到一个项目。要做的明显好于(正常/非完美)哈希,你需要一个好的算法,你需要保持负载因子在表中大约75%左右 - 即你需要允许像3200万额外(空)点,以提高从四个探针到三个预期的复杂性。我可能只是老式的,但这让我作为一个很多额外的存储空间,用于这样一个小的速度改进。



OTOH,这是真的,这几乎是理想的情况下完美散列 - 集合是提前知道,关键是相当小(重要,因为散列通常是密钥大小线性)。即使如此,除非键分布相当不均匀,我不会期望任何巨大的改进 - 完美的哈希函数通常(通常是)相当复杂。


I would like to store a mapping of an integer key to a float value in-memory.

I have roughly 130 million keys (and, accordingly, 130 million values).

My focus is on lookup performance -- I have to do many, many millions of lookups.

The C++ STL library has a map class for associative arrays of this sort. I have several questions about map.

What is the storage overhead of map for a dataset of the size mentioned above? How does storage overhead scale, in general, with map?

It looks like the underlying data structure for map is a red-black, balanced binary tree. It sounds like the real-world performance for this is O(log n) for insertion and retrieval.

It mentions O(1) for a hinted insertion. My input is pre-sorted, so I believe I should be able to provide a hint for insertion events. How would I provide this hint, using the methods listed here?

Is there an STL container that provides better lookup performance?

Are there other publicly-available, open-source frameworks with an associate array class that uses an underlying data structure that would perform better than STL map?

If writing my own container class would provide better lookup performance, what data structures might I research?

I am using GCC 4 for this task, running under either Linux or Mac OS X.

I apologize in advance if these are dumb questions. Thank you for your advice.

解决方案

Given what you've said, I'd think very hard about using an std::vector<pair<int, float> >, and using std::lower_bound, std::upper_bound, and/or std::equal_range to look up values.

While the exact overhead of std::map can (and does) vary, there's little or no room for question that it will normally consume extra memory and look up values more slowly than a binary search in a vector. As you've noted, it's normally (and almost unavoidably) implemented as some sort of balanced tree, which imposes overhead for the pointers and the balancing information, and typically means each node is allocated separately as well. Since your nodes are pretty small (typically 8 bytes) that extra data is likely to be at least as much as what you're actually storing (i.e. at least 100% overhead). Separate allocations often mean poor locality of reference, which leads to poor cache usage.

Edit: Looking just at implementations of std::map, it's probably worth noting that most use a red-black tree. If you were going to use an std::map, an implementation that uses an AVL tree would probably suit your purposes better -- an AVL tree has slightly tighter constraints on balancing. This gives slightly faster lookup at the expense of slightly slower insertion and deletion (since it has to re-balance more often to maintain its stricter interpretation of "balanced"). As long as your data remains constant during use, however, an std::vector is still almost certainly better.

One other possibility worth noting: if your keys are at least fairly even distributed, you might want to try looking up using interpolation instead of bisection. i.e. instead of always starting at the middle of the vector, you do a linear interpolation to guess at the most likely starting point for the lookup. Of course, if your keys follow some known non-linear distribution, you can use a matching interpolation instead.

Edit 2: Assuming the keys are reasonably even distributed, the interpolation search has a complexity of O(log log N). For 130 million keys, that works out to around 4 probes to find an item. To do significantly better than that with (normal/non-perfect) hashing, you need a good algorithm, and you need to keep the load factor in the table around 75% or so -- i.e. you need to allow for something like 32 million extra (empty) spots in your table to improve the expected complexity from four probes to three. I may just be old fashioned, but that strikes me as a lot of extra storage to use for such a small speed improvement.

OTOH, it's true that this is nearly the ideal situation for perfect hashing -- the set is known ahead of time, and the key is quite small (important, since hashing is normally linear on the key size). Even so, unless the keys are distributed pretty unevenly, I wouldn't expect any huge improvement -- a perfect hash function is often (usually?) fairly complex.

这篇关于C ++映射插入和查找性能和存储开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆