用于排序数组的紧凑数据结构 [英] Compact data structure for sorted array

查看：183 发布时间：2016/12/25 13:21:09 arrays data-structures compression

本文介绍了用于排序数组的紧凑数据结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含排序数字的表格：

  1 320102 
 2 5200100 
 3 92010023 
 4 112010202 
 5 332020201 
 6 332020411 
：
 5000000000 3833240522044511 
 5000000001 3833240522089999 
 5000000002 4000000000213312

给定记录号，我需要O（log n）时间中的值。记录号为64位长，并且没有缺少的记录号。值是64位长，它们被排序，值（n）值（n + 1）。

显而易见的解决方案是简单地做一个数组，并使用记录号作为索引。这将花费64位每个值。

但我想要一个更空间高效的方法。因为我们知道值总是增加，应该是可行的，但我不记得一个数据结构，让我这样做。

一个解决方案是使用deflate on

你知道一个数据结构，它会给我：

/ p>

O（log n）用于访问

64位/值

= Edit =

数字提前我们可以找到每个数字之间的差异。通过采取这些差异的第99百分位数，我们将获得相对适度的数字。取log2将给我们表示中等数字所需的位数 - 让我们调用那个modest-bits。

然后创建：

 记录的64位值0 
记录的64位值1024 
记录的64位值2048 
记录的64位值3072 
记录的64位值4096

所有记录的表：

  modest-bits差值记录0 
与上一个记录的modest-bits差异
 1022 *与上一个记录的modest-bits差异
 modest-bits差值记录1024

记录k * 1024的最小比特差将总是为0，因此我们可以使用信令。如果它是非零的，那么下面的64位将是一个指向简单数组的指针，用于接下来的1024个记录作为64位值。

值被选择为第99百分位数，这将最多发生1％的时间，从而浪费至多1％* n * modest-bits + 1％* n * 64-bit * 1024。
$ b p空间：O（modest-bits * n + 64-bit * n / 1024 + 1％* n * modest-bits + 1％* n * 64-bit * 1024） >

查找：O（1 + 1024）

（可能需要调整99％和1024）

= Edit2 =

基于上述想法，但浪费更少的空间。创建此：

 记录的64位值0 
记录的64位值1024 
 64位记录2048 
记录的64位值3072 
记录的64位值4096

对于所有无法由modest-bits表示的值，以树形式创建大值表：

  64位位置，64位值
 64位位置，64位值
 64位位置，64位值

然后为所有记录创建一个增量表，该表为每1024个记录重置：

  modest-bits差值记录0 
与上一个记录的modest-bits差值
 1022 *与上一个记录的modest-bits差值
 modest-bits差值记录1024

，但也会为大值表中的每个值重置

 
 
  space：O（modest-bits * n + 64-bit * n / 1024 + 1％* n * 2 * 64-bit）。
 
 
 查找需要搜索大值表，然后查找第1024个值，最后累加适度的位值。
 
 
 查找：O（log -value table）+ 1 + 1024）= O（log n）
 
 
 你能改善吗？或者以不同的方式做得更好？
解决方案
 OP建议将数字分割成块（只有一次）。但是这个过程可以继续。再次拆分每个块。再次...最后，我们可能会得到一个二叉树。
 
 
   
 
 
 根节点包含具有最小索引的数字的值。其右后裔存储表中的中间数字与具有最小索引的数字之间的差： d = A [N / 2] -A [0] -N / 2 。这对于其他权利后代（图上的红色节点）继续。叶节点包含来自之前数字的增量： d = A [i + 1] -A [i] -1 。
 
 
 因此，大多数值存储在trie中，是delta值。它们中的每个占用小于64位。并且为了紧凑，它们可以作为可变比特长度数字存储在比特流中。为了获得每个数字的长度并在O（log N）时间内导航到这个结构中，比特流还应该包含（一些）数字和（一些）子树的长度：
 
 每个节点包含其左子树的长度（以位为单位）（如果有）。
 
 每个右后代，除了叶节点，包含其值的长度（以位为单位）。叶节点的长度可以从从根到该节点的路径上的其他长度计算。
 
 每个右后代（图上的红色节点）包含对应值的差异和最近的红色节点在路径上。
 
 所有节点都以位流打包，从根节点开始，按顺序：左后裔总是跟随其祖先; 
 
 
 要访问给定索引的元素，使用索引的二进制表示遵循路径。 trie。在遍历此路径时，将红色节点的所有值添加在一起。 
 
 
 有几个选项可存储N / 2个值的长度：
 
 根据需要为每个长度分配尽可能多的位，以表示从最大长度到低于平均长度的某处（不包括一些非常短的离群值）。
 $ b $  
 由于长度可能不均匀分布，因此对值长度使用霍夫曼编码是合理的。
 > 
 
固定长度或霍夫曼编码对于每个线索深度应该不同。
 
 
  N / 4子树长度实际上是值长度，因为N / 4个最小子树包含单个值。
 
 
 其它N / 
 
 
 对于2 ³⁰全范围64位数，我们必须包装大约34位值，对于3/4节点，大约。 4位值长度，以及对于每第四个节点，10位子树长度。 
 
 
  
 $ b   $ p>  0 320102 
 1 5200100 
 2 92010023 
 3 112010202 
 4 332020201 
 5 332020411 
 6 3833240522044511 
 7 3833240522089999 
 8 4000000000213312 
  
为这些值命名：
 根d = 320102 vl = 19 tl = 84 + 8 + 105 + 4 + 5 = 206 
 + -l tl = 75 + 5 = 84 
 | + -l tl = 23 
 | | + -l 
 | | | + -r d = 4879997（vl = 23）
 | | + -r d = 91689919 vl = 27 
 | | + -r d = 20000178（v1 = 25）
 | + -r d = 331700095 vl = 29 tl = 8 
 | + -l 
 | | + -r d = 209（v1 = 8）
 | + -r d = 3833240190024308 vl = 52 
 | + -rd = 45487（vl = 16）
 + -rd = 3999999999893202 vl = 52 
  
值长度编码：
 位开始结束
根0 19 19 
深度1 0 52 52 
深度2 0 29 29 
深度3 5 27 52 
深度4 4 8 23 
  
子树长度每个需要8位。 
 
 
这里是编码流（为了可读性，二进制值仍然以十进制显示）：
 位值注释
 19 320102根值
 8 206根子树的长度根
 8 84 84子树长度
 4 15最小的左子树长度（基值为8）
 23 4879997索引1的值
 5 0索引2的值长度（基本值为27）
 27 91689919值对于索引2 
 25 20000178索引3的值
 29 331700095索引4的值
 4 0最小左子树长度（基本值8）
 8 209索引值5 
 5 25索引6的值长度（基本值为27）
 52 3833240190024308索引值6 
 16 45487索引值7 
 52 3999999999893202索引值8 
  
共有285位或5个64位字。我们还需要存储来自值长度编码表（350位）的位/起始值。为了存储635位，我们需要10个64位字，这意味着这样小的数字表不能被压缩。 
 
 
 要搜索索引7的值，请读取根值（320102），跳过206位，添加值对于索引4（331700095），跳过8位，为索引6添加值（3833240190024308），为索引7添加值（45487），并添加索引（7）。其结果是3 833 240 522 089 999，如预期。
 
I have a table with sorted numbers like:
1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
: 
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312
Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).

The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.

But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.

A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.

Do you know of a data structure that will give me:


O(log n) for access
space requirement < 64-bit/value


= Edit =

Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.

Then create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
Then a delta table for all records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.

As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.

space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)

lookup: O(1 + 1024)

(99% and 1024 may have to be adjusted)

= Edit2 =

Based on the idea above, but wasting less space. Create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
And for all value that cannot be represented by modest-bits create big-value table as a tree:
64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value
Then a delta table for all records, that is reset for every 1024 records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
but also reset for every value that is in the big-value table.

space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).

Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.

lookup: O(log(big-value table) + 1 + 1024) = O(log n)

Can you improve this? Or do better in a different way?
 解决方案 
OP proposes splitting numbers into blocks (only once). But this process may be continued. Split every block once more. And again... Finally we might get a binary trie.



Root node contains value of the number with least index. Its right descendant stores difference between the middle number in the table and the number with least index: d = A[N/2] - A[0] - N/2. This is continued for other right descendants (red nodes on diagram). Leaf nodes contain deltas from preceding numbers: d = A[i+1] - A[i] - 1.

So most of the values, stored in trie, are delta values. Each of them occupies less than 64 bits. And for compactness they may be stored as variable-bit-length numbers in a bit stream. To get length of each number and to navigate in this structure in O(log N) time, bit stream should also contain lengths of (some) numbers and (some) subtrees:

Each node contains length (in bits) of its left sub-tree (if it has one).
Each right descendant (red nodes on diagram), except leaf nodes, contains length (in bits) of its value. Leaf node's length may be calculated from other lengths on the path from root to this node.
Each right descendant (red nodes on diagram) contains difference of correspondent value and the value of nearest "red" node up the path.
All nodes are packed in bit stream, starting from root node, in-order: left descendant always follows its ancestor; right descendant follows sub-tree, rooted by left descendant.
To access element given its index, use index's binary representation to follow path in the trie. While traversing this path, add together all values of "red" nodes. Stop when no more non-zero bits are left in the index.

There are several options to store N/2 value lengths:

Allocate as many bits for each length as needed to represent all values from the largest length to somewhere below mean length (excluding some very short outliers).
Also exclude some long outliers (keep them in a separate map).
Since lengths may be not evenly distributed, it's reasonable to use Huffman encoding for value lengths.
Either fixed length or Huffman encodings should be different for each trie depth.

N/4 subtree lengths are, in fact, value lengths, because N/4 smallest subtrees contain a single value.

Other N/4 subtree lengths may be stored in words of fixed (predefined) length, so that for large subtrees we know only approximate (rounded up) lengths.

For 2³⁰ full-range 64-bit numbers we have to pack approximately 34-bit values, for 3/4 nodes, approx. 4-bit value lengths, and for every fourth node, 10-bit subtree lengths. Which saves 34% space.



Example values:
0 320102
1 5200100
2 92010023
3 112010202
4 332020201
5 332020411
6 3833240522044511
7 3833240522089999
8 4000000000213312
Trie for these values:
root               d=320102           vl=19    tl=84+8+105+4+5=206
   +-l                                         tl=75+4+5=84
   | +-l                                       tl=23
   | | +-l
   | | | +-r       d=4879997          (vl=23)
   | | +-r         d=91689919         vl=27
   | |   +-r       d=20000178         (vl=25)
   | +-r           d=331700095        vl=29    tl=8
   |   +-l
   |   | +-r       d=209              (vl=8)
   |   +-r         d=3833240190024308 vl=52
   |     +-r       d=45487            (vl=16)
   +-r             d=3999999999893202 vl=52
Value length encoding:
           bits start end
Root       0    19    19
depth 1    0    52    52
depth 2    0    29    29
depth 3    5    27    52
depth 4    4    8     23
Sub-tree lengths need 8 bits each.

Here is encoded stream (binary values still shown in decimal for readability):
bits value                      comment
19   320102                     root value
8    206                        left subtree length of the root
8    84                         left subtree length
4    15                         smallest left subtree length (with base value 8)
23   4879997                    value for index 1
5    0                          value length for index 2 (with base value 27)
27   91689919                   value for index 2
25   20000178                   value for index 3
29   331700095                  value for index 4
4    0                          smallest left subtree length (with base value 8)
8    209                        value for index 5
5    25                         value length for index 6 (with base value 27)
52   3833240190024308           value for index 6
16   45487                      value for index 7
52   3999999999893202           value for index 8
Altogether 285 bits or 5 64-bit words. We also need to store bits/start values from value length encoding table (350 bits). To store 635 bits we need 10 64-bit words, which means such a small number table cannot be compressed. For larger number tables, size of value length encoding table is negligible.

To search a value for index 7, read root value (320102), skip 206 bits, add value for index 4 (331700095), skip 8 bits, add value for index 6 (3833240190024308), add value for index 7 (45487), and add index (7). The result is 3 833 240 522 089 999, as expected.

                        这篇关于用于排序数组的紧凑数据结构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

用于排序数组的紧凑数据结构 [英] Compact data structure for sorted array

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于排序数组的紧凑数据结构 [英] Compact data structure for sorted array

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭