用Java解压缩内存中的ZIP文件 [英] Uncompressing a ZIP file in memory in Java

查看:155
本文介绍了用Java解压缩内存中的ZIP文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在下载包含XML的压缩文件,我希望在操作它们之前避免将zip文件写入磁盘,因为延迟要求。但是, java.util.zip 对我来说还不够。没有办法说这里是zip文件的字节数组,使用它而不将其转换为流,而 ZipInputStream 不可靠,因为它会扫描条目头(请参阅下面的编辑,了解为什么不可靠的原因)。

I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).

我还没有访问我将要处理的zip文件,所以我不知道知道我是否能够通过 ZipInputStream 来处理它们,我需要找到一个适用于任何有效ZIP文件的解决方案,因为一次失败就会受到惩罚我投入生产会很高。

I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.

假设ZipInputStream不起作用,在没有条目标题的情况下,如何解决这个问题呢?我正在使用维基百科的定义,其中包括对如何使用的评论正确解压缩zip文件(引用如下)作为标准。

Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.

编辑

Apache Commons Zip库对使用Stream的一些问题进行了良好的写作 (他们的解决方案和Java的)都有。我将进一步补充,从维基百科和个人经验来看,条目标题上的大小和crc字段可能无法填充(我在这些字段中的文件为-1)。感谢中心提供此链接。

The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.

另外,让我引用维基百科的主题:

Also, let me quote the wikipedia on the subject:


正确读取zip档案必须扫描
各个字段的签名,即zip中心目录。它们不能扫描
条目,因为只有该目录指定文件块
的起始位置。扫描可能会导致误报,因为格式不是
禁止其他数据在块之间,或者是包含此类签名的未压缩流

Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures.

请注意 ZipInputStream 会扫描条目,而不是中心目录,这就是它的问题。

Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.

最终修改

如果有人有兴趣,此脚本可用于生成现有ZIP文件中 ZipInputStream 无法读取的有效ZIP文件。所以,作为这个封闭问题的最终编辑,我需要一个可以读取文件的库,例如这个脚本生成的文件。

If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.

推荐答案

编辑:另一个建议......

Another suggestion...

从Apache Commons实施中查看 ZipFile ,它看起来像它不会难以有效地为您的项目分叉。在你的字节数组周围创建一个包装器,它包含所需的 RandomAccessFile API的所有部分(我认为不是很多)。您已经表示您更喜欢 ZipFile 的界面,为什么不选择呢?

Looking at ZipFile from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile, so why not go with that?

我们不喜欢对你的项目了解不足以了解这是否会引起任何法律问题 - 即使你提供了详细信息,我怀疑这里的任何人都能提供良好的法律建议 - 但我怀疑它不会超过一个小时或者两个让这个解决方案起作用,我怀疑你对它有合理的信心。

We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.

编辑:这可能是一个稍微高效的答案...

This may be a slightly more productive answer...

如果您担心条目不连续,但又不想自己处理所有压缩方面,您可以考虑一个有效重写数据的选项。创建一个新的 ByteArrayOutputStream ,并在末尾读取中心目录。对于中央目录中的每个条目,以您认为 ZipInputStream 将满意的格式向输出流写出条目(标题+数据)。然后编写一个新的中央目录 - 如果您希望您的替换有效,您可能需要从头开始执行此操作,但如果您使用的代码知道将不会实际读取中心目录,你可以提供原始的,忽略它可能不会有效的事实。只要从正确的签名开始,这可能已经足够好了:)

If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)

完成后,转换 ByteArrayOutputStream 进入 new byte [] ,将其包装在 ByteArrayInputStream 中然后将其传递给 ZipInputStream ZipArchiveInputStream

Once you've done that, convert the ByteArrayOutputStream into a new byte[], wrap it in a ByteArrayInputStream and then pass that to ZipInputStream or ZipArchiveInputStream.

取决于根据你的目的,你可能甚至不需要做那么多 - 你可以通过创建一个迷你zip文件来提取每个文件,只有你一次从目录中读取的一个条目。

Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.

这个 涉及了解zip文件格式,但不完全 - 只是骨架,有效。这不是一个快速简单的解决方案,比如完全使用现有的API,但它不应该非常。它不能保证它能够读取所有无效文件(它怎么可能?)但它会保护你免受你似乎特别关注的条目之间的数据问题。希望它至少是一个有用的想法...

This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...


没有办法说这里是一个zip文件的字节数组,使用它

there's no way to say "here's a byte array of a zip file, use it"

是的:

byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);

这就留下了 ZipInputStream 是否可以的问题处理你给它的所有zip文件 - 但是我不会那么快就把它写下来。

That leaves the issue of whether ZipInputStream can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.

当然,还有其他可用的API。例如,您可能需要查看 Apache Commons Compress 。即使 ZipFile 需要一个文件, ZipArchiveInputStream 不会 - 再次,您可以使用 ByteArrayInputStream 。编辑:看起来像 ZipArchiveStream 也不会从中心目录读取。我希望它能预先使用 markSupported 进行检查,但似乎不是......

Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile requires a file, ZipArchiveInputStream doesn't - so again, you could use a ByteArrayInputStream. It looks like ZipArchiveStream doesn't read from the central directory either. I was hoping it would use markSupported to check beforehand, but it appears not to...

编辑:在关于这个问题的评论,我问你在哪里读到zip文件不必包含条目数据。你引用维基百科:

In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:


正确阅读zip档案的工具必须扫描各个字段的签名,即zip中心目录。他们必须不扫描条目,因为只有目录指定文件块开始的位置。扫描可能导致误报,因为格式不禁止其他数据在块之间,或者包含此类签名的未压缩流。

"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."

这与可选的条目数据不同。它说在尴尬的地方可能有额外的数据,而不是条目可能完全丢失。它基本上是说不应该假设条目是连续的。我很高兴地承认 ZipInputStream 可能没有读取文件末尾的中心目录,但是查找代码与找到处理代码的代码不一样输入数据不存在。

That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.

然后写下:


我可能会进一步添加拉链是否有效不是我关心的问题。使用它是。

I might further add that whether the zip is valid or not is not my concern. Working with it is.

...这表明你想要处理无效zip文件的代码。结合这个:

... which suggests you want code which will handle invalid zip files. Combined with this:


我还没有访问我将要处理的zip文件,所以我不知道我是否将能够通过流处理它们

I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream

这意味着您要求的代码应该处理无效的zip文件你甚至无法预测的方式。你能够拒绝它有多么无效?如果我给你1000个随机字节,而根本没有尝试将它们作为一个zip文件,那你用它做什么呢?

That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?

基本上,你需要在确定特定库是否是有效解决方案之前,可以更加严格地解决问题。从各个地方收集一组zip文件是合理的,这些文件可能以易于理解的方式无效,并说我必须能够支持所有这些。如果结果不够好,可能需要做一些工作。但是,为了能够支持任何事情,无论多么破碎,都不是一个有效的要求。

Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.

这篇关于用Java解压缩内存中的ZIP文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆