COM pression算法排序的整数 [英] compression algorithm for sorted integers

查看:110
本文介绍了COM pression算法排序的整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

予有随机整数排序从最低到最高的一个大的序列。这些数字从1位开始,近45位结束。在列表的开始,我有一个数字非常接近对方:4,20,23,40,66但当号开始获得更高它们之间的距离是有点高过(实际上它们之间的距离是偶然)。没有重复数

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numbers very closer to each other: 4, 20, 23, 40, 66. But when the numbers start to get higher the distance between them is a bit higher too (actually the distance between them is aleatory). There are no duplicated numbers.

我使用位打包保存一些空间,但无论如何,这个文件可以变得很巨大的。

I'm using bit packing to save some space, but anyway this file can get really huge.

我想知道什么样的COM pression算法可以用来在这种情况下,或任何其他技术,以节省尽可能多的空间可能。

I would like to know what kind of compression algorithm can be used in this situation, or any other technique to save as much space as possible.

感谢你。

推荐答案

您可以COM preSS最佳,如果你知道数据的真实分布。如果你可以为每一个整数,您可以使用算术编码或其他熵编码的概率分布技术融为一体preSS理论最小尺寸。

You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.

的窍门是在predicting准确。

首先,你应该COM preSS在距离之间的数字,因为这可以让你做统计报表。如果你对COM preSS的数字直接你有一个困难时期,因为它们仅出现一次造型它们。

First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.

接下来,你可以尝试建立一个非常简单的模型$ P ​​$ pdict 接下来的距离。保留所有previously看到距离的直方图和频率计算概率。

Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.

您可能需要考虑遗漏值(你显然不能为它们分配0的概率,因为这不是前pressible),但你可以使用启发式,像编码下一个距离逐位和 predicting每一位单独。因为他们几乎都是0和熵编码优化他们离开你将支付几乎没有任何的高位。

You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.

所有这些都是非常简单的,如果你的知道的分配。例如:你是什么样的COM pressing你知道距离的理论分布,因为有公式,所有质数的列表。所以,你已经有了一个完美的模型。

All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.

这篇关于COM pression算法排序的整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆