用于构建和查找整数范围集的数据结构 [英] Data structure to build and lookup set of integer ranges

查看:34
本文介绍了用于构建和查找整数范围集的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组 uint32 整数,其中可能有数百万个项目.其中 50-70% 是连续的,但在输入流中它们以不可预知的顺序出现.

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.

我需要:

  1. 将此集合压缩为范围以实现节省空间的表示.已经使用简单的算法实现了这一点,因为只计算一次的范围速度在这里并不重要.经过这种转换后,结果范围的数量通常在 5 000-10 000 之间,当然,其中许多是单项.

  1. Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.

测试某个整数的成员资格,不需要有关集合中特定范围的信息.这个必须非常快——O(1).正在考虑 最小完美散列函数,但它们不能很好地处理范围.Bitsets 空间效率非常低.其他结构,如二叉树,复杂度为 O(log n),最糟糕的是,实现会产生许多条件跳转,并且处理器无法很好地预测它们,从而导致性能不佳.

Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.

有没有专门针对整数范围的数据结构或算法来解决这个任务?

Is there any data structure or algorithm specialized in integer ranges to solve this task?

推荐答案

关于第二个问题:

您可以在 Bloom Filters 上查找.Bloom Filters 专门设计用于回答 O(1) 中的成员资格问题,尽管响应是 nomaybe(不像是/否那样明确:p).

You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

maybe 的情况下,当然,您需要进一步处理才能实际回答问题(除非概率答案在您的情况下就足够了),但即便如此,Bloom Filter 也可能充当门keeper,并直接拒绝大部分查询.

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

此外,您可能希望在不同的结构中保留实际范围和退化范围(单个元素).

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

  • 单个元素最好存储在哈希表中
  • 实际范围可以存储在排序数组中

这减少了存储在排序数组中的元素数量,从而减少了在那里执行的二进制搜索的复杂性.既然你说许多范围是退化的,我认为你只有大约 500-1000 个范围(即少一个数量级),并且 log(1000) ~ 10

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

因此,我建议采取以下步骤:

I would therefore suggest the following steps:

  • 布隆过滤器:如果没有,则停止
  • 实数范围的排序数组:如果是,则停止
  • 单个元素的哈希表

首先执行排序数组测试,因为从您给出的数字(数百万个数字合并在几千个范围内)如果包含一个数字,它很可能在一个范围内而不是单个 :)

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

最后一点:当心 O(1),虽然它看起来很吸引人,但你不是在渐近情况下.几乎没有 5000-10000 范围,因为 log(10000) 大约是 13.所以不要通过获得具有如此高常数因子的 O(1) 解决方案来悲观你的实现,以至于它实际上运行得比 O(log N) 解决方案:)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

这篇关于用于构建和查找整数范围集的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆