对于任何实际数据集,数据压缩率的最小值可能是最小的 [英] What can be the least possible value of data-compression-ratio for any real dataset

查看:317
本文介绍了对于任何实际数据集,数据压缩率的最小值可能是最小的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个嵌入式硬件压缩器的API,例如使用缩放算法压缩给定输入流的 c> ZLIB



进一步之前,我想解释一下数据压缩比。数据压缩比定义为未压缩大小与压缩大小之间的比率。





压缩比通常大于1。这意味着压缩数据通常小于未压缩数据,这是完全压缩的要点。但事实并非如此。例如使用 ZLIB 库和在一些Linux机器上生成的伪随机数据大致提供0.996的压缩比。这意味着9960字节压缩成10000字节。



我知道 ZLIB 通过使用类型0块来处理这种情况简单地返回原始的未压缩数据,大致5字节头,因此它只能提供高达64KB数据块的5字节开销。这是这个问题的智能解决方案,但由于某些原因我无法在我的API中使用这个。我必须提前提供额外的安全空间来处理这种情况。



现在,如果我知道最不可能的已知数据压缩比,我很容易计算我必须提供额外的空间。否则要安全,我必须提供超过所需的额外空间,这对嵌入式系统来说至关重要。



在计算数据压缩比时,我不关心标题,页脚,非常小的数据集和系统具体细节,因为我单独处理。我特别感兴趣的是,存在最小尺寸为1K的任何真实数据集,并且可以使用deflate算法提供小于 0.99 的压缩比。在这种情况下,计算将是:

压缩比=未压缩大小/(使用deflate压缩大小,不包括页眉,页脚和系统特定开销)



请提供反馈信息。任何帮助将不胜感激。如果可以提供这种数据集的引用,那将是非常好的。



编辑:

@MSalter注释表示硬件压缩机没有正确地遵循放气规格,这可能是微码中的错误。

解决方案

放气算法具有与ZLIB算法。它使用3位标题,而下面的两个位是 00 ,当下面的块被保存为长度前缀但未被压缩时。



这意味着最糟糕的情况是一个字节输入,最多可以播放6个字节(3位头,32位长度,8位数据,5位填充),所以最差的比例是1/6 = 0.16。



这当然是假设一个最佳的编码器。次优编码器将传送一个字节的霍夫曼表。


I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.

Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.

Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.

I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.

Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.

While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)

Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.

EDIT:
@MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.

解决方案

The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.

This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.

This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

这篇关于对于任何实际数据集,数据压缩率的最小值可能是最小的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆