数据结构可以有效地在查询范围内查找整数 [英] Data structure to find integers within a query range efficiently

查看:142
本文介绍了数据结构可以有效地在查询范围内查找整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已知范围中有一个任意数量的不同无符号整数值。

There is an arbitrary amount of distinct unsigned integer values within a known range.

整数值的数目为<<

The number of integer values is << the number of integers within the range.

我想构建一个允许以下运行时复杂性的数据结构:

I want to build a data structure which allows the following runtime complexities:


  1. 插入O(1)

  2. 插入完成后:

    • 删除在O(1)中

    • 获取O(k)中的查询范围内的所有值,其中k是结果值的数量(返回值不必排序)

内存复杂度不受限制。但是,天文数据量不足; - )

Memory complexity is not restricted. However, an astronomically large amount of memory is not available ;-)

这是一个例子:


  • range = [0,1023]

  • insert 42

  • insert 350

  • 插入729

  • insert 64

  • insert 1

  • insert 680

  • insert 258

  • 在[300; 800]中查找值;返回{350,729,680}

  • 删除350

  • 删除680

  • 在[35 ; 1000];删除{42,258,64,729,258}

  • 删除42

  • 删除258

  • find值在[0; 5];返回{1}

  • 删除1

  • range = [0, 1023]
  • insert 42
  • insert 350
  • insert 729
  • insert 64
  • insert 1
  • insert 680
  • insert 258
  • find values in [300;800] ; returns {350, 729, 680}
  • delete 350
  • delete 680
  • find values in [35;1000] ; returns {42, 258, 64, 729, 258}
  • delete 42
  • delete 258
  • find values in [0; 5] ; returns {1}
  • delete 1

这样的数据结构是否可行? (借助查表等)?

Is such a data structure even possible? (with the aid of look-up tables etc)?

我想到的一个近似将是:


  • 将插入的值分成桶。 0..31 =>桶0,32,83 =>桶1,64..95 =>桶2,96..127 =>桶3,...

  • Bin the inserted values into buckets. 0..31 => bucket 0, 32..63 => bucket 1, 64..95 => bucket 2, 96..127 => bucket 3, ...

插入:使用简单的移位算法查找bucket id,然后将其插入到每个数据块的数组

Insertion: find bucket id using simple shifting arithmetic, then insert it into an array per bucket

查找:查找开始的bucket id和终点使用移位算术。查看第一个和最后一个bucket中的所有值,并检查它们是否在范围内或超出范围。将所有中间存储桶中的所有值添加到搜索结果

Find: find bucket id of start and endpoint using shifting arithmetic. Look through all values in the first and last bucket and check if they are within the range or outside the range. Add all values in all intermediate buckets to the search result

删除:使用移位查找桶ID。交换值以桶中的最后一个值进行删除,然后减少此桶的计数。

Delete: find bucket id using shifting. Swap value to delete with last value in bucket, then decrement count for this bucket.

下降:如果有很多查询查询范围小于32的范围,每次都会搜索整个数据桶。

Downside: if there are many queries which query a range which has a span of less than 32 values, the whole bucket will be searched every time.

下行2:如果范围内有空桶他们也将在搜索阶段进行访问。

Downside 2: if there are empty buckets within the range, they will also be visited during the search phase.

推荐答案

理论上讲, van Emde Boas tree 是您最好的选择,O(log log M) - 时间操作,其中M是范围的大小。空间使用量相当大,尽管有更有效的变体。

Theoretically speaking, a van Emde Boas tree is your best bet, with O(log log M)-time operations where M is the size of the range. The space usage is quite large, though there are more efficient variants.

实际上,理论上的技术状态在论文一维范围报告,由Mortensen,Pagh和Patrascu。

Actually the theoretical state of the art is described in the paper On Range Reporting in One Dimension, by Mortensen, Pagh, and Patrascu.

我不知道现有的下限是否排除O(1),但是M不会大到足以使区分的重要性。而不是vEB结构,我只是使用一个k-ary trie,其中有两个像32或64的ka电源。

I'm not sure if the existing lower bounds rule out O(1), but M won't be large enough to make the distinction matter. Instead of the vEB structure, I would just use a k-ary trie with k a power of two like 32 or 64.

编辑:这里有一种方法来进行范围搜索一个trie。

here's one way to do range search with a trie.

我们假设每个数据是一个有点模式(足够简单,这就是CPU的想法)。每个子树由具有特定前缀的所有节点组成。例如,{0000,0011,0101,1001}由以下4-ary trie表示,其中 X 表示空指针。

Let's assume each datum is a bit pattern (easy enough; that's how the CPU think of it). Each subtree consists of all of the nodes with a certain prefix. For example, {0000, 0011, 0101, 1001} is represented by the following 4-ary trie, where X denotes a null pointer.

+---+---+---+---+
|00\|01\|10\|11X|
+--|+--|+--|+---+
   |   |   |
   |   |   +----------------------------+
+--+   |                                |
|      +------------+                   |
|                   |                   |
v                   v                   v
+---+---+---+---+   +---+---+---+---+   +---+---+---+---+
|00\|01X|10X|11\|   |00X|01\|10X|11X|   |00X|01\|10X|11X|
+--|+---+---+--|+   +---+--|+---+---+   +---+--|+---+---+
   |           |           |                   |
   v           v           v                   v
  0000        0011        0101                1001

几个优化很快变得明显。首先,如果所有的位模式都是相同的长度,那么我们不需要将它们存储在树叶上 - 它们可以从下降路径重建。所有我们需要的是位图,如果k是机器字中的位数,则很适合以前级别的指针。

A couple optimizations quickly become apparent. First, if all of the bit patterns are the same length, then we don't need to store them at the leaves—they can be reconstructed from the descent path. All we need is the bitmap, which if k is the number of bits in a machine word, fits nicely where the pointer from the previous level used to be.

+--------+--------+--------+--------+
|00(1001)|01(0100)|10(0100)|11(0000)|
+--------+--------+--------+--------+

为了搜索trie中的[0001,1000]范围,我们从根开始,确定哪个子树可能与该范围相交,递归他们。在这个例子中,根的相关子元素是00,01和10. 00的相关子元素是表示前缀0001,0010和0011的子树。

In order to search the trie for a range like [0001, 1000], we start at the root, determine which subtrees might intersect the range and recurse on them. In this example, the relevant children of the root are 00, 01, and 10. The relevant children of 00 are the subtrees representing the prefixes 0001, 0010, and 0011.

对于k固定,来自k-ary trie的报告是O(log M + s),其中M是位模式的数量,s是命中数。不要被愚弄 - 当k是中等的时候,每个节点占用几条缓存线,但是这个特技不是很高,所以缓存未命中的数量很小。

For k fixed, reporting from a k-ary trie is O(log M + s), where M is the number of bit patterns and s is the number of hits. Don't be fooled though—when k is medium, each node occupies a couple cache lines but the trie isn't very high, so the number of cache misses is pretty small.

这篇关于数据结构可以有效地在查询范围内查找整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆