什么是一些位数组的替代方法? [英] What are some alternatives to a bit array?

查看:156
本文介绍了什么是一些位数组的替代方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个信息检索应用程序,可以创建10位数位的位数组。数组中的set位的数量变化很大,从全部清除到全部。目前,我正在使用一个直线位数组( java.util.BitSet ),所以我的每个位数组都需要几兆字节。

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.

我的计划是查看第一个 N 位的基数,然后决定使用剩余的数据结构。显然,一些数据结构对于非常稀疏的位数组是更好的,而当大部分位被设置时,其他数据结构更好(当大多数位被设置时,我可以使用否定来将其当作稀疏集合)。

My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).


  • 什么结构可能在每个极端都有好处?

  • 中间有什么吗?

以下是一些约束或提示: / p>

Here are a few constraints or hints:


  1. 这些位只设置一次,按索引顺序。

  2. 我需要100%的准确性,所以像Bloom过滤器这样的东西还不够好。

  3. 在构建集合之后,我需要能够有效地遍历set位。

  4. 这些位是随机分布的,所以运行长度–编码算法不太可能比一个简单的位索引列表好多了。

  5. 我试图优化内存利用率,但速度仍然承载着一些重量。

  1. The bits are set only once, and in index order.
  2. I need 100% accuracy, so something like a Bloom filter isn't good enough.
  3. After the set is built, I need to be able to efficiently iterate over the "set" bits.
  4. The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
  5. I'm trying to optimize memory utilization, but speed still carries some weight.

开源Java实现的东西是有帮助的,但不是绝对必要的。我对基础知识更感兴趣。

Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.

推荐答案

除非数据是真正随机的 具有对称的1/0分布,那么这简单地成为无损数据压缩问题,非常类似于用于黑白和(二进制)传真图像的CCITT Group 3压缩。 CCITT Group 3使用霍夫曼编码方案。在传真的情况下,他们使用固定的霍夫曼码集合,但对于给定的数据集,您可以为每个数据集生成一组特定的代码,以提高实现的压缩比。只要您只需要依次访问这些位,就像您所暗示的那样,这将是一个非常有效的方法。随机访问会产生一些额外的挑战,但是您可能会生成一个二叉搜索树索引到数组中的各个偏移点,这样可以让您接近所需的位置,然后从那里进入。

Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.

注意 :即使数据是随机的,只要1/0分布不均匀,霍夫曼方案仍然可行。也就是说,分布越均匀越好压缩比。

Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.

最后,如果这些位是真正随机的均匀分布,那么,先生。 Claude Shannon ,您将无法使用任何方案压缩任何重要数量。

Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.

这篇关于什么是一些位数组的替代方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆