如何有效地预测数据是否可压缩 [英] How to efficiently predict if data is compressible

查看:138
本文介绍了如何有效地预测数据是否可压缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写一个存储后端来存储更大的数据块。数据可以是任何东西,但它主要是二进制文件(图像,pdfs,jar文件)或文本文件(xml,jsp,js,html,java ...)。我发现大部分数据已经压缩了。如果一切都压缩,可以节省大约15%的磁盘空间。



我正在寻找最高效的算法,可以预测一大块数据



压缩算法可以是LZF,Deflate(压缩),也可以是压缩算法,或类似的东西(也许Google Snappy)。因此,预测数据是否可压缩应该比压缩数据本身快得多,并且使用更少的内存。



我已经知道的算法:




  • 尝试压缩数据的一个子集,比如说128字节(这有点慢)。


  • 计算128个字节的总和,如果它在一定范围内,那么它很可能是不可压缩的(在128 * 127的10%内)(这是快的,相对好,但我正在寻找更多可靠,因为算法实际上只考虑每个字节的最高位)


  • 查看文件头(相对可靠, p>



我想一般的想法是,我需要一个算法,可以快速计算的概率,字节大约为0.5。



更新



我已经实现了'ASCII检查','熵计算' 简化压缩,都给出了良好的效果。我想改进算法,现在我的想法是不仅预测数据是否可以压缩,而且还可以压缩多少。可能使用算法的组合。现在,如果我只能接受多个答案...我会接受给出最好结果的答案。



其他答案(新想法)仍然受欢迎!如果可能,使用源代码或链接: - )

解决方案

根据我的经验,几乎所有可以有效压缩的格式非二进制。所以检查如果大约70-80%的字符在[0-127]愤怒内应该做的伎俩。



如果你想要它正确 (即使我真的看不到这样做的原因),你必须对数据运行(部分)你的压缩算法,或者像tskuzzy已经提出的那样计算熵。


I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java...). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.

I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let's say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.

The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.

Algorithms I already know about:

  • Try to compress a subset of the data, let's say 128 bytes (this is a bit slow)

  • Calculate the sum of 128 bytes, and if it's within a certain range then it's likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I'm looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)

  • Look at the file headers (relatively reliable, but feels like cheating)

I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.

Update

I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression', and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results.

Additional answers (new ideas) are still welcome! If possible, with source code or links :-)

解决方案

From my experience almost all of the formats that can effectively be compressed are non-binary. So checking if about 70-80% of the characters are within in the [0-127] rage should do the trick.

If you want to to it "properly" (even though I really can't see a reason to do that), you either have to run (parts of) your compression algorithm on the data or calculate the entropy, as tskuzzy already proposed.

这篇关于如何有效地预测数据是否可压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆