随机访问gzip压缩文件? [英] Random access to gzipped files?

查看:146
本文介绍了随机访问gzip压缩文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的文件压缩gzip坐在磁盘上。生产环境是云,所以存储性能可怕,但CPU很好。以前,我们的数据处理流程从 gzip -dc 开始,将数据流出磁盘。

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously our data processing pipeline began with gzip -dc streaming the data off the disk.

并行工作,我想运行多个管道,每个管道取一对字节偏移 - 开始和结束 - 并取该文件的块。对于普通文件,这可以通过 head tail 来实现,但我不知道如何有效与压缩文件;如果我 gzip -dc 并且管道到,朝向文件末尾的偏移对将涉及浪费

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through whole file as it's slowly decompressed.

所以我的问题是关于gzip算法 - 理论上可以寻找一个字节偏移在底层文件或获取它的任意块,没有解压缩整个文件到此点的全部含义?如果不是,我还能如何有效地将一个文件分区为多个进程的随机访问,同时最小化I / O吞吐量开销。

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for 'random' access by multiple processes while minimising the I/O throughput overhead?

推荐答案

你不能用gzip做,但你可以用bzip2,这是块而不是基于流 - 这是如何Hadoop DFS拆分和并行化读取巨大的文件与MapReduce算法中不同的映射器。也许这是有意义的重新压缩您的文件作为bz2所以你可以利用这一点;

You can't do that with gzip, but you can do it with bzip2, which is block instead of stream-based - this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

我发现在Hadoop中实现这个补丁,这里: https://issues.apache.org/jira/browse/HADOOP-4012

I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

这是关于这个主题的另一篇文章:在Hadoop中读取BZip2文件

Here's another post on the topic: BZip2 file read in Hadoop

也许浏览Hadoop源代码可以让你了解如何通过块读取bzip2文件。

Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.

这篇关于随机访问gzip压缩文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆