随机访问 gzip 文件? [英] Random access to gzipped files?

查看:23
本文介绍了随机访问 gzip 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在磁盘上有一个用 gzip 压缩的非常大的文件.生产环境是云"的,所以存储性能很差,但CPU还好.以前,我们的数据处理管道开始于 gzip -dc 从磁盘流式传输数据.

现在,为了并行化工作,我想运行多个管道,每个管道都采用一对字节偏移量 - 开始和结束 - 并获取文件的那个块.对于普通文件,这可以通过 headtail 来实现,但我不确定如何使用压缩文件有效地做到这一点;如果我 gzip -dc 并通过管道输入 head,那么靠近文件末尾的偏移对将涉及在整个文件缓慢解压时浪费地搜索.>

所以我的问题实际上是关于 gzip 算法的 - 理论上是否可以在底层文件中寻找字节偏移量或获取其中的任意块,而没有解压整个文件的全部含义?如果没有,我还能如何有效地为随机"分区文件分区?多个进程访问同时最小化 I/O 吞吐量开销?

解决方案

你不能用 gzip 做到这一点,但你可以用 bzip2 做到这一点,它是基于块而不是基于流的——这就是 Hadoop DFS 拆分的方式并在其 MapReduce 算法中使用不同的映射器并行读取大文件.也许将您的文件重新压缩为 bz2 是有意义的,以便您可以利用这一点;这比将文件分块的一些特殊方式更容易.

我在这里找到了在 Hadoop 中实现此功能的补丁:https://issues.apache.org/jira/browse/HADOOP-4012

这是关于该主题的另一篇文章:在 Hadoop 中读取的 BZip2 文件

也许浏览 Hadoop 源代码会让您了解如何按块读取 bzip2 文件.

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

解决方案

You can't do that with gzip, but you can do it with bzip2, which is block instead of stream-based - this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

Here's another post on the topic: BZip2 file read in Hadoop

Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.

这篇关于随机访问 gzip 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆