一种空间有效的数据结构,用于存储和查找大量(均匀分布)整数 [英] A space efficient data structure to store and look-up through a large set of (uniformly distributed) Integers

查看:140
本文介绍了一种空间有效的数据结构,用于存储和查找大量(均匀分布)整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求在内存中查找一百万个均匀分布的整数。
我的工作负载非常仔细。

我目前的实现使用了一个HashSet(Java)。我看到很好的查找性能,但内存使用情况并不理想(几十MB)。

你能想到一个更有效(内存)的数据结构吗? >
编辑:解决方案需要支持少量的数据结构添加。



背景: / strong>

上面提到的整数问题是简化以下问题:

我有一套一百万个字符串(我的字典),我想告诉字典是否包含给定的字符串。

该字典太大,无法适应内存,因此我愿意牺牲一点精度来减少内存占用。我将通过切换到包含每个String的Hashcode值(整数)的Dictionary来实现,而不是实际的字符。我假设每个字符串碰撞的机会只有 1M / 2 ^ 32

解决方案

虽然Jon Skeet的回答为小投资节省了很多,但我认为你可以做得更好。由于您的数字相当均匀,所以您可以使用内插搜索来快速查找(大致为O(log log N)而不是O(log N))。对于一百万个项目,您可以计划大约4个比较,而不是大约20个。



如果你想做更多的工作来削减内存(大致)再一次,您可以将其构建为两级查找表,基本上是一种简单版本的trie。





然而,当我们查看第二个表时,我们已经知道了16位的值,所以我们只需要存储所有32位的值,其他 16位的值。



这意味着代替第二级占用4兆字节,我们将其减少到2兆字节。除此之外,我们需要第一级表,但只有65536x4 = 256K字节。



这几乎肯定会提高整个数据集的二进制搜索的速度。在最坏的情况下(使用第二级的二进制搜索),我们可以有多达17个比较(1 + log 2 65536)。平均值会比以前更好 - 由于我们只有一百万个项目,每个二级分区中平均只能有1_000_000 / 65536 =〜15个项目,大约1 + log 2 (16)= 5比较。在第二级使用内插搜索可能会进一步减少,但是当您仅开始5次比较时,您没有太多空间可以进行真正的戏剧性改进。给定平均只有〜15项在第二级,你使用的搜索类型不会有太大的区别 - 即使一个线性搜索将是相当快。



当然,如果你想要进一步,再使用一个4级表(一个用于整数的每个字节)。然而,这可能是开放的,但是,这是否会为您节省更多的费用来值得的麻烦。至少直接,我立即猜测,你会做相当少的额外的工作,相当少的节省(只存储百万整数的最终字节显然占用1兆字节,三个级别的表导致这将清楚占用相当多的数量,所以你可以将半个兆字节的数量翻一番,如果你处于一个更多的保存状态会产生很大的改变,那就去做 - 我怀疑这个回报是否会证明额外的投资。


I'm required to hold, in memory, and look-up through one million uniformly distributed integers. My workload is extremely look-up intensive.
My current implementation uses a HashSet (Java). I see good look-up performance, but the memory usage is not ideal (dozens of MB).
Could you think of a more efficient (memory) data structure?
Edit: The solution will need to support a small amount additions to the data stracture.

Background:
The Integers problem stated above is a simplification of the following problem:
I have a set of one million Strings (my "Dictionary"), and I want to tell whether the Dictionary contains a given string, or not.
The Dictionary is too large to fit in memory, so I'm willing to sacrifice a tiny bit of accuracy to reduce memory footprint. I'll do that by switching to a Dictionary containing each String's Hashcode value (integer), instead of the actual chars. I'm Assuming that the chance of a collision, per string, is only 1M/2^32.

解决方案

While Jon Skeet's answer gives good savings for a small investment, I think you can do better. Since your numbers are fairly even distributed, you can use an interpolating search for faster lookups (roughly O(log log N) instead of O(log N)). For a million items, you can probably plan on around 4 comparisons instead of around 20.

If you want to do just a little more work to cut the memory (roughly) in half again, you could build it as a two-level lookup table, basically a sort of simple version of a trie.

You'd break your (presumably) 32-bit integer into two 16-bit pieces. You'd use the first 16 bits as an index into the first level of the lookup table. At this level, you'd have 65536 pointers, one for each possible 16-bit value for that part of your integer. That would take you to the second level of the table. For this part, we'd do a binary or interpolation search between the chosen pointer, and the next one up -- i.e., all the values in the second level that had that same value in the first 16 bits.

When we look in the second table, however, we already know 16 bits of the value -- so instead of storing all 32 bits of the value, we only have to store the other 16 bits of the value.

That means instead of the second level occupying 4 megabytes, we've reduced it to 2 megabytes. Along with that we need the first level table, but it's only 65536x4=256K bytes.

This will almost certainly improve speed over a binary search of the entire data set. In the worst case (using a binary search for the second level) we could have as many as 17 comparisons (1 + log2 65536). The average will be better than that though -- since we have only a million items, there can only be an average of 1_000_000/65536 = ~15 items in each second-level "partition", giving approximately 1 + log2(16) = 5 comparisons. Using an interpolating search at the second level might reduce that a little further, but when you're only starting with 5 comparisons, you don't have much room left for really dramatic improvements. Given an average of only ~15 items at the second level, the type of search you use won't make much difference -- even a linear search is going to be pretty fast.

Of course, if you wanted to you could go a step further and use a 4-level table instead (one for each byte in the integer). It may be open to question, however, whether that would save you enough more to be worth the trouble. At least right off, my immediate guess is that you'd be doing a fair amount of extra work for fairly minimal savings (just storing the final bytes of the million integers obviously occupies 1 megabyte, and three levels of table leading to that would clearly occupy a fair amount more, so you'd double the number of levels to save something like half a megabyte. If you're in a situation where saving just a little more would make a big difference, go for it -- but otherwise, I doubt whether the return will justify the extra investment.

这篇关于一种空间有效的数据结构,用于存储和查找大量(均匀分布)整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆