有没有一种方法可以检查缓冲区是否为Brotli压缩格式? [英] Is there a way to check if a buffer is in Brotli compressed format?

查看:88
本文介绍了有没有一种方法可以检查缓冲区是否为Brotli压缩格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名实习生,研究是否在软件中使用Brotli压缩是否比使用GZip的当前版本提高了性能.

我的任务是使用GZip更改任何内容以改为使用Brotli压缩.我需要替换的一个功能进行检查,以测试缓冲区是否包含使用GZip压缩的数据.它通过检查开头和结尾的流标识符来做到这一点:

  bool isGzipped()常量{//Gzip文件签名(0x1f8b)返回(_bufferEnd> = _bufferStart + 2)&&(static_cast< unsigned char>(_ bufferStart [0])== 0x1f)&&(static_cast< unsigned char>(_ bufferStart [1])== 0x8b);} 

我想创建类似的函数 bool isBrotliEncoded().我想知道是否可以使用Brotli编码的缓冲区进行类似的快速检查?我看过brotli产生的某些压缩文件的字节值,但是我找不到适合所有压缩文件的规则.有些以 0x5B 开头,有些以 0x1B 开头,空文件的压缩结果为 0x06 ,而多次压缩的文件则以一个范围开头不同的值.每个文件的末尾也不一致.

我知道测试格式是否正确的唯一方法是尝试解压缩并等待错误,这违背了进行此测试的目的.

所以我的问题是:有谁知道如何在不尝试解压缩和等待失败的情况下检查是否已使用Brotli压缩了缓冲区?

解决方案

不幸的是,原始brotli格式不太适合这种检测,即使只是尝试解压缩并等待错误.

我对随机数据的一百万个brotli解压缩进行了试验.他们中约有5%的人认为是优质的肉肠.因此,您已经在这里遇到了问题.百万的3.5%是单个字节,因为有9个单字节值是有效的brotli流.随机有效流的平均长度几乎是一个兆字节.

对于那些检测到错误的情况(大约百万例的95%),在检测到错误之前,有3.5%的数据超出了兆字节.1.4%的存储空间超过了10兆字节.发现错误之前,随机字节的平均数量为309 KB.另一个问题.

简而言之,误报的可能性相对较高,查找负数时要处理的字节数可能会很大.

如果您正在编写此软件,则应将自己的标头放在brotli数据之前,以帮助进行检测.或者,您也可以使用我根据他们的要求开发的 brotli框架格式,在brotli压缩流之前具有唯一的四字节标头.这样可以大大降低误报的可能性.

I'm an intern doing research into whether using Brotli compression in a piece of software provides a performance boost over the current release, which uses GZip.

My task is to change anything using GZip to use Brotli compression instead. One function I need to replace does a check to test if a buffer contains data that was compressed using GZip. It does this by checking the stream identifier at the beginning and end:

bool isGzipped() const
{
    // Gzip file signature (0x1f8b)
    return
        (_bufferEnd >= _bufferStart + 2) &&
        (static_cast<unsigned char>(_bufferStart[0]) == 0x1f) &&
        (static_cast<unsigned char>(_bufferStart[1]) == 0x8b);
}

I want to create similar function bool isBrotliEncoded(). I was wondering if there is a similar quick check that can can be done with Brotli encoded buffers? I've had a look at the byte values for some of the compressed files that brotli produces, but I can't find a rule that holds for all of them. Some start with 0x5B, some with 0x1B, compression of empty files results in 0x06, and files that have been compressed multiple times start with a range of different values. The end of each file is also inconsistent.

The only way I know of to test if it is in the correct format is to attempt decompression and wait for an error, which defeats the purpose of doing this test.

So my question is: Does anyone know how to check if a buffer has been compressed with Brotli without attempting decompression and waiting for failure?

解决方案

Unfortunately, the raw brotli format is not well suited to such detection, even when simply trying to decompress and waiting for an error.

I ran a trial of one million brotli decompressions of random data. About 5% of them checked out as good brotli streams. So you've already got a problem right there. 3.5% of the million are a single byte, since there are nine one-byte values that are each a valid brotli stream. The mean length of the random valid streams was almost a megabyte.

For those in which an error was detected (about 95% of the million cases), 3.5% went more than a megabyte before the error was detected. 1.4% went more than ten megabytes. The mean number of random bytes before finding an error was 309 KB. Another problem.

In short, the probability of a false positive is relatively high, and the number of bytes to process to find a negative can be quite large.

If you are writing this software, then you should put your own header before the brotli data to aid in detection. Or you can use the brotli framing format that I developed at their request, which has a unique four-byte header before the brotli compressed stream. That would reduce the probability of a false positive dramatically.

这篇关于有没有一种方法可以检查缓冲区是否为Brotli压缩格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆