构建和查找整数范围集的数据结构 [英] Data structure to build and lookup set of integer ranges

查看:1027
本文介绍了构建和查找整数范围集的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组 uint32 整数,集合中可能有数百万项。其中50-70%是连续的,但在输入流中,它们以不可预测的顺序出现。



我需要:


  1. 将此集合压缩到范围以实现空间效率表示。已经使用微不足道的算法实现了这一点,因为只有一次速度在这里并不重要。经过这个转换之后的数量通常在5000到000 000之间,当然很多都是单项。


  2. 测试一些整数的成员资格,不需要关于集合中特定范围的信息。这个必须非常快 - O(1)。正在考虑最小完美散列函数,但是它们的范围不能很好。 Bitsets 的空间效率很低。其他结构,如二进制树,具有O(log n)的复杂性,最糟糕的是它们实现了许多条件跳转和处理器无法很好的预测它们的表现。


是否有任何专门用于整数范围的数据结构或算法来解决此任务?

解决方案

关于第二个问题:



您可以查看 Bloom Filters 。 Bloom过滤器专门用于回答O(1)中的会员问题,尽管答案是 no 也许 (这不是一个明确的一个是/否:p)。



可能案例中,当然,你需要进一步的处理才能真正回答这个问题(除非你的情况下有一个概率性的答案),但即使如此,Bloom Filter也可以作为门将,直接拒绝大部分查询。



此外,您可能希望在不同结构中保留实际范围和简并范围(单个元素)。




  • 单个元素可能最好存储在散列表中

  • 实际范围可以存储在排序数组中



这减少了排序数组中存储的元素的数量,从而减少了执行的二进制搜索的复杂性。由于你声明许多范围是退化的,所以我认为你只有500-1000个范围(即一个数量级),而log(1000)〜10



因此,我建议以下步骤:




  • Bloom过滤器:如果否,停止

  • 排序实数范围数组:如果是,停止

  • 单个元素的哈希表



首先执行排序数组测试,因为从数字(数百个数字合并在几千个范围内),如果包含一个数字,则可能会在一个范围内,而不是单个):



最后一个注意事项:请注意O(1),虽然可能看起来很吸引人,但您不是在渐近的情况。只有5000-10000的范围是很少的,因为log(10000)是13。所以不要通过得到一个O(1)解决方案,以一个如此高的常数因子,它实际上比O慢(log N )解决方案:)


I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.

I need to:

  1. Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.

  2. Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.

Is there any data structure or algorithm specialized in integer ranges to solve this task?

解决方案

Regarding the second issue:

You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

  • single elements may be best stored in a hash-table
  • actual ranges can be stored in a sorted array

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

I would therefore suggest the following steps:

  • Bloom Filter: if no, stop
  • Sorted Array of real ranges: if yes, stop
  • Hash Table of single elements

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

这篇关于构建和查找整数范围集的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆