如何强制 zlib 解压超过 X 个字节? [英] How to force zlib to decompress more than X bytes?

查看:28
本文介绍了如何强制 zlib 解压超过 X 个字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含压缩内容和 32 字节标头的文件.标头包含时间戳、压缩大小和未压缩大小等信息.

文件本身大约 490mb,标题表明未压缩大小接近 2.7gb(这显然是错误的,因为它也认为压缩大小为 752mb).

我已经剥离了标头并生成了压缩的有效负载,并且可以使用 zlib 对其进行解压缩.

问题是它只解压了 19kb,远小于 490mb(应该是最低限度,但我预计大约有 700mb 未压缩).

我的代码如下:

导入zlib定义消耗(输入文件):内容 = inputFile.read()打印正在尝试处理"+ str(len(内容))+字节..."outfile = open('output.xml', 'w')inputFile = zlib.decompress(content)打印正在尝试写入"+ str(len(inputFile)) +字节..."outfile.write(输入文件)outfile.close()infile = open('payload', 'rb')消费(文件内)infile.close()

运行时,程序输出:

<块引用><块引用><块引用>

正在尝试处理 489987232 字节...正在尝试写入 18602 字节...

我尝试使用 zlib.decompressionobj(),尽管这会生成错误的标题警告.zlib.decompress() 工作正常并生成我期望的解压缩 XML……只是太少了.

非常感谢任何指示或建议!

解决方案

您的文件显然已损坏.

您将无法强制 zlib 忽略损坏——如果这样做,您很可能会得到 700MB 的垃圾,或一些随机数量的垃圾,或者……嗯,这取决于什么腐败在哪里.但是你得到任何有用的东西的机会非常渺茫.

zlib 的块不是随机访问的,也不是定界的,甚至是字节对齐的;除非您能够处理前一个块,否则很难判断您何时到达了下一个块.

另外,树从一个块到另一个块生长,所以即使你可以跳到下一个块,你的树也是错误的,除非你非常非常幸运并且不需要树的一部分.更糟糕的是,任何块都可以重新启动树(甚至切换压缩器);如果您错过了那个,即使您很幸运确实,您也正在解压缩垃圾.而且这不仅仅是因为我不认识它而跳过这个字符串"的问题,如果你不认识,你甚至不知道字符串有多少位,所以你不能跳过它.这让我们回到第一点——你甚至不能跳过一个字符串,更不用说整个块了.

要更好地理解这一点,请参阅RFC 1951,其中描述了<代码>zlib.尝试手动处理一些琐碎的示例(第一个块中只有几个字符串,第二个块中只有几个新字符串),看看以难以撤消的方式破坏它们是多么容易(除非您确切地知道它们是如何产生的)已损坏).这不是不可能(毕竟,破解加密消息并非不可能),但我不相信它可以完全自动化,而且你可能不会为了好玩而做这件事.>

如果您有关键数据(并且不能只是重新下载、回滚到以前的版本、从备份中恢复等),一些数据恢复服务声称能够恢复损坏的 zlib/gz/zip 文件.我猜这会花费一条胳膊和一条腿,但对于正确的数据,这可能是正确的答案.

当然,我可能错了,这是不可自动化的.有一堆 zip 恢复工具.据我所知,他们对损坏的 zlib 流所能做的就是跳过该文件并恢复其他文件……但也许其中一些有一些技巧可以在某些情况下处理损坏的流.

I have a file that consists of compressed content plus a 32 byte header. The header contains info such as timestamp, compressed size, and uncompressed size.

The file itself is about 490mb and the header indicates the uncompressed size is close to 2.7gb (it's clearly incorrect, as it also believes the compressed size to be 752mb).

I've stripped the header and generated the compressed payload and can uncompress it with zlib.

The problem is that it is only decompressing 19kb, which is much smaller than 490mb (the bare minimum it should be, but I'm expecting around 700mb uncompressed).

My code is below:

import zlib

def consume (inputFile):
    content = inputFile.read()
    print "Attempting to process " + str(len(content)) + " bytes..."
    outfile = open('output.xml', 'w')
    inputFile = zlib.decompress(content)
    print "Attempting to write " + str(len(inputFile)) + " bytes..."
    outfile.write(inputFile)
    outfile.close()

infile = open('payload', 'rb') 

consume(infile)

infile.close()

When ran, the program outputs:

Attempting to process 489987232 bytes... Attempting to write 18602 bytes...

I've tried to use zlib.decompressionobj(), though this generates an incorrect header warning. zlib.decompress() works fine and produces the decompressed XML that I expect...just far too little of it.

Any pointers or suggestions are greatly appreciated!

解决方案

You clearly have a corrupted file.

You won't be able to force zlib to ignore the corruption—and, if you did, you'd most likely get either 700MB of garbage, or some random amount of garbage, or… well, it depends on what the corruption is and where. But the chances that you could get anything useful are pretty slim.

zlib's blocks aren't random-accessable, or delimited, or even byte-aligned; it's very hard to tell when you've reached the next block unless you were able to handle the previous block.

Plus, the trees grow from block to block, so even if you could skip to the next block, your trees would be wrong, and you'd be decompressing garbage unless you get very, very lucky and don't need the broken part of the tree. Even worse, any block can restart the trees (or even switch the compressor); if you miss that, you're decompressing garbage even if you do get very lucky. And it's not just a matter of "skip this string because I don't recognize it", you don't even know how many bits long the string is if you don't recognize, so you can't skip it. Which brings us back to the first point—you can't even skip a single string, much less a whole block.

To understand this better, see RFC 1951, which describes the format used by zlib. Try manually working through a few trivial examples (just a couple strings in the first block, a couple new ones in the second block) to see how easy it is to corrupt them in a way that's hard to undo (unless you know exactly how they were corrupted). It's not impossible (after all, cracking encrypted messages isn't impossible), but I don't believe it could be fully automated, and it's not something you're likely to do for fun.

If you've got critical data (and can't just re-download it, roll back to the previous version, restore from backup, etc.), some data recovery services claim to be able to recover corrupted zlib/gz/zip files. I'm guessing this costs an arm and a leg, but it may be the right answer for the right data.

And of course I could be wrong about this not being automatable. There are a bunch of zip recovery tools out there. As far as I know, all they can do with broken zlib streams is skip that file and recover the other files… but maybe some of them have some tricks that work in some cases with broken streams.

这篇关于如何强制 zlib 解压超过 X 个字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆