为什么gzip压缩缓冲区的大小比未压缩缓冲区的大? [英] Why gzip compressed buffer size is greater then uncompressed buffer?

查看:135
本文介绍了为什么gzip压缩缓冲区的大小比未压缩缓冲区的大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个compress utils类.
但是在测试期间,我发现结果比原始缓冲区要大.
我的代码对吗?

I'm trying to write a compress utils class.
But during the test, I find the result it greater than original buffer.
Is my codes right ?

请查看代码:

/**
 * This class provide compress ability
 * <p>
 * Support:
 * <li>GZIP
 * <li>Deflate
 */
public class CompressUtils {
    final public static int DEFAULT_BUFFER_SIZE = 4096; // Compress/Decompress buffer is 4K

    /**
     * GZIP Compress
     * 
     * @param data The data will be compressed
     * @return The compressed data
     * @throws IOException
     */
    public static byte[] gzipCompress(byte[] data) throws IOException {
        Validate.isTrue(ArrayUtils.isNotEmpty(data));

        ByteArrayInputStream bis = new ByteArrayInputStream(data);
        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        try {
            gzipCompress(bis, bos);
            bos.flush();
            return bos.toByteArray();
        } finally {
            bis.close();
            bos.close();
        }
    }

    /**
     * GZIP Decompress
     * 
     * @param data The data to be decompressed
     * @return The decompressed data
     * @throws IOException
     */
    public static byte[] gzipDecompress(byte[] data) throws IOException {
        Validate.isTrue(ArrayUtils.isNotEmpty(data));

        ByteArrayInputStream bis = new ByteArrayInputStream(data);
        ByteArrayOutputStream bos = new ByteArrayOutputStream();

        try {
            gzipDecompress(bis, bos);
            bos.flush();
            return bos.toByteArray();
        } finally {
            bis.close();
            bos.close();
        }
    }

    /**
     * GZIP Compress
     * 
     * @param is The input stream to be compressed
     * @param os The compressed result
     * @throws IOException
     */
    public static void gzipCompress(InputStream is, OutputStream os) throws IOException {
        GZIPOutputStream gos = null;

        byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
        int count = 0;

        try {
            gos = new GZIPOutputStream(os);
            while ((count = is.read(buffer)) != -1) {
                gos.write(buffer, 0, count);
            }
            gos.finish();
            gos.flush();
        } finally {
            if (gos != null) {
                gos.close();
            }
        }
    }

    /**
     * GZIP Decompress
     * 
     * @param is The input stream to be decompressed
     * @param os The decompressed result
     * @throws IOException
     */
    public static void gzipDecompress(InputStream is, OutputStream os) throws IOException {
        GZIPInputStream gis = null;

        int count = 0;
        byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];

        try {
            gis = new GZIPInputStream(is);
            while ((count = is.read(buffer)) != -1) {
                os.write(buffer, 0, count);
            }
        } finally {
            if (gis != null) {
                gis.close();
            }
        }
    }
}

这是测试代码:

public class CompressUtilsTest {
    private Random random = new Random();

    @Test
    public void gzipTest() throws IOException {
        byte[] buffer = new byte[1023];
        random.nextBytes(buffer);
        System.out.println("Orignal: " + Hex.encodeHexString(buffer));

        byte[] result = CompressUtils.gzipCompress(buffer);
        System.out.println("Compressed: " + Hex.encodeHexString(result));

        byte[] decompressed = CompressUtils.gzipDecompress(result);
        System.out.println("DeCompressed: " + Hex.encodeHexString(decompressed));

        Assert.assertArrayEquals(buffer, decompressed);
    }
}

结果是: 原始长度为1023个字节 压缩后的长度为1036字节

And the result is: original is 1023 bytes long compressed is 1036 bytes long

这是怎么回事?

推荐答案

在测试中,您使用一组随机字符初始化缓冲区.

In your test you initialize the buffer with a set of random characters.

GZIP包含两个部分:

GZIP consists of two parts:

  1. LZW压缩
  2. 使用霍夫曼代码进行编码
  1. LZW compression
  2. Encoding using a Huffman code

前者严重依赖于输入中的重复序列.基本上,它表示类似以下内容:接下来的10个字符与以索引X开头的10个字符相同". 在您的情况下,(可能)没有这样的重复序列,因此第一种算法没有压缩.

The former relies heavily on repeated sequences in the input. Basically it says something like: "The next 10 characters are the same as the 10 characters staring at index X". In your case there are (possibly) no such repeated sequences, thus no compression by the first algorithm.

另一方面,霍夫曼编码应该可以工作,但是总的来说,GZIP开销(例如,存储使用的霍夫曼编码的存储)超过了压缩输入的优点.

The Huffman encoding on the other hand should work, but in total the GZIP overhead (storing the used Huffman encoding, e.g.) outweighs the advantages of compressing the input.

如果使用真实文件测试算法,则会得到一些有意义的结果.

If you test your algorithm with real files, you will get some meaningful results.

在尝试压缩XML之类的结构化文件时,通常会获得最佳结果.

Best results are usually acquired when trying to compress structured files like XML.

这篇关于为什么gzip压缩缓冲区的大小比未压缩缓冲区的大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆