为有序数组紧凑的数据结构 [英] Compact data structure for sorted array

查看：160 发布时间：2016/6/1 21:53:26 arrays data-structures compression

本文介绍了为有序数组紧凑的数据结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有排序号码表所示：

1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
: 
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312

由于记录号，我需要在O值（log n）的时间。记录编号为64位长，并且没有丢失的记录编号。的值是64位长时，被分类和（n）值和下;值（N + 1）。

Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).

显而易见的解决方案是简单地做一个数组，并使用记录数为指标。这将花费每个值64位。

The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.

不过，我想这样做的更多的空间有效的方式。因为我们知道的值总是越来越多，应该是可行的，但我不记得有一个数据结构，让我做到这一点。

But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.

一个解决方案是使用紧缩阵列上，但是这不会给我O（log n）的访问一个元素 - 因而不能接受

A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.

你知道一个数据结构，这将使我：

Do you know of a data structure that will give me:

O（log n）的访问

的空间需求和LT; 64位/值

=编辑=

既然我们知道事先我们可以找到每个数字之间的差的所有数字。通过采取这些差异的第99百分位数，我们会得到一个相对温和一些。服用LOG2会给我们重新present温和一些需要的位数 - 我们称之为适度位

Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.

然后创建一个这样的：

64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096

然后所有记录的增量表：

Then a delta table for all records:

modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024

谦虚位差值来记录K * 1024将始终为0，所以我们可以利用它来进行信号。如果它是不为零，那么下面的64位将是一个指向一个简单阵列为下一个1024记录为64位值。

modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.

由于适度的值被选择为第99百分位数号码，将在大部分发生的时间的1％，因此浪费至多1％* N *适度位+ 1％* N * 64位* 1024。

As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.

空间：O（谦虚位* N + 64位* N / 1024 + 1％* N *谦虚位+ 1％* N * 64位* 1024）

space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)

查询：O（1 + 1024）

lookup: O(1 + 1024)

（99％和1024可能必须调整）

(99% and 1024 may have to be adjusted)

= EDIT2 =

= Edit2 =

根据上面的想法，但浪费更少的空间。创建此：

Based on the idea above, but wasting less space. Create this:

64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096

和为$通过适度位psented p $创造大价值表不能重新所有值作为树：

And for all value that cannot be represented by modest-bits create big-value table as a tree:

64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value

然后所有记录的增量表，即重置为每1024记录：

Then a delta table for all records, that is reset for every 1024 records:

modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024

也重置每一个是在大值表值。

but also reset for every value that is in the big-value table.

空间：O（谦虚位* N + 64位* N / 1024 + 1％* N * 2 * 64位）

space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).

查找需要寻找大值表，然后找了1024'th值，最后总结谦虚位值。

Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.

查询：O（日志（大值表）+ 1 + 1024）= O（log n）的

lookup: O(log(big-value table) + 1 + 1024) = O(log n)

你能提高吗？或以不同的方式做的更好？

Can you improve this? Or do better in a different way?

推荐答案

OP提出分割成数块（只有一次）。但这个过程可以继续进行。拆分每个块一次。又一次......最后我们可能得到一个二进制特里。

OP proposes splitting numbers into blocks (only once). But this process may be continued. Split every block once more. And again... Finally we might get a binary trie.

根节点包含至少索引的数目的值。在表中间的数字，并以最小的索引数之间的权后裔店的区别： D = A [N / 2] - A [0] - N / 2 。这是持续了其他权利的后代（图上的红色节点）。叶节点包含preceding号三角洲： D = A [I + 1] - A [I] - 1

Root node contains value of the number with least index. Its right descendant stores difference between the middle number in the table and the number with least index: d = A[N/2] - A[0] - N/2. This is continued for other right descendants (red nodes on diagram). Leaf nodes contain deltas from preceding numbers: d = A[i+1] - A[i] - 1.

所以大部分的价值，存放在线索的，是增量值。他们每个人都占据不到64位。和紧凑它们可以被存储为在比特流中的可变位长度的数字。要获得每个号码的长度，在此结构为O导航（日志N）的时间，比特流还应包含（部分）号和（一些）子树的长度：

So most of the values, stored in trie, are delta values. Each of them occupies less than 64 bits. And for compactness they may be stored as variable-bit-length numbers in a bit stream. To get length of each number and to navigate in this structure in O(log N) time, bit stream should also contain lengths of (some) numbers and (some) subtrees:

每个节点包含它的左子树的长度（位）（如果有的话）。

每个正确的后裔（图上红色的节点），除叶节点，包含其值的长度（位）。叶节点的长度可从根到此节点的路径上其他长度来计算。

每个正确的后裔（图上红色的节点）中包含对应的值和最近的红起来的节点的路径的价值差异。

所有节点都装在比特流，从根节点开始，按顺序：左后裔始终遵循它的祖先;正确的后代如下子树，由左后代扎根。

要访问的元素给定了索引，使用索引的二进制重新presentation跟随路径的特里。虽然穿越这条道路，新增的红节点的所有值加在一起。当没有更多的非零位留在索引停止

To access element given its index, use index's binary representation to follow path in the trie. While traversing this path, add together all values of "red" nodes. Stop when no more non-zero bits are left in the index.

有几个选项来存储N / 2值长度：

There are several options to store N/2 value lengths:

根据需要重新present从最大长度低于某个地方的所有值平均长度（不包括一些很短的异常值）分配的比特数为各自的长度。

另外，还要排除一些长期离群值（让他们在一个单独的地图）。

由于长度可能不是均匀分布的，这是合理的使用霍夫曼编码值的长度。

无论是固定长度或霍夫曼编码应为每个线索的深度不同。

Either fixed length or Huffman encodings should be different for each trie depth.

N / 4的子树的长度，事实上，值的长度，因为N / 4，最小的子树包含单个值

N/4 subtree lengths are, in fact, value lengths, because N/4 smallest subtrees contain a single value.

其他N / 4子树的长度可以存储在固定的（predefined）长度的话，那么，对于大子树，我们只知道大概（四舍五入）的长度。

Other N/4 subtree lengths may be stored in words of fixed (predefined) length, so that for large subtrees we know only approximate (rounded up) lengths.

有关2 ^{我们要收拾大约34位值，为3/4节点，约30}全系列64位数字。 4位值的长度，并为每一个第四节点，10位的子树的长度。节省34％的空间。

For 2³⁰ full-range 64-bit numbers we have to pack approximately 34-bit values, for 3/4 nodes, approx. 4-bit value lengths, and for every fourth node, 10-bit subtree lengths. Which saves 34% space.

示例值：

0 320102
1 5200100
2 92010023
3 112010202
4 332020201
5 332020411
6 3833240522044511
7 3833240522089999
8 4000000000213312

特里这些值：

root               d=320102           vl=19    tl=84+8+105+4+5=206
   +-l                                         tl=75+4+5=84
   | +-l                                       tl=23
   | | +-l
   | | | +-r       d=4879997          (vl=23)
   | | +-r         d=91689919         vl=27
   | |   +-r       d=20000178         (vl=25)
   | +-r           d=331700095        vl=29    tl=8
   |   +-l
   |   | +-r       d=209              (vl=8)
   |   +-r         d=3833240190024308 vl=52
   |     +-r       d=45487            (vl=16)
   +-r             d=3999999999893202 vl=52

值长度编码：

           bits start end
Root       0    19    19
depth 1    0    52    52
depth 2    0    29    29
depth 3    5    27    52
depth 4    4    8     23

子树的长度需要在每次8位。

Sub-tree lengths need 8 bits each.

下面是连接codeD流（二进制值仍然十进制的可读性所示）：

Here is encoded stream (binary values still shown in decimal for readability):

bits value                      comment
19   320102                     root value
8    206                        left subtree length of the root
8    84                         left subtree length
4    15                         smallest left subtree length (with base value 8)
23   4879997                    value for index 1
5    0                          value length for index 2 (with base value 27)
27   91689919                   value for index 2
25   20000178                   value for index 3
29   331700095                  value for index 4
4    0                          smallest left subtree length (with base value 8)
8    209                        value for index 5
5    25                         value length for index 6 (with base value 27)
52   3833240190024308           value for index 6
16   45487                      value for index 7
52   3999999999893202           value for index 8

总共285位或5的64位字。我们还需要存储比特/开始从值长度编码表（350位）值。存储635比特，我们需要10 64位字，这意味着这样的小号码表不能COM pressed。对于较大数量的表，价值长度编码表的大小是可以忽略的。

Altogether 285 bits or 5 64-bit words. We also need to store bits/start values from value length encoding table (350 bits). To store 635 bits we need 10 64-bit words, which means such a small number table cannot be compressed. For larger number tables, size of value length encoding table is negligible.

要搜索指数7，读根值（320102）的值，则跳过206位，对指数4（331700095）增加价值，跳过8位，对于指数6（3833240190024308）增加值，指数7增加值（ 45487），并添加索引（7）。其结果是3 833 240 522 089 999，符合市场预期。

To search a value for index 7, read root value (320102), skip 206 bits, add value for index 4 (331700095), skip 8 bits, add value for index 6 (3833240190024308), add value for index 7 (45487), and add index (7). The result is 3 833 240 522 089 999, as expected.

这篇关于为有序数组紧凑的数据结构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为有序数组紧凑的数据结构 [英] Compact data structure for sorted array

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为有序数组紧凑的数据结构 [英] Compact data structure for sorted array

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭