随机访问gzip压缩文件？ [英] Random access to gzipped files?

查看：146 发布时间：2016/12/26 15:06:25 unix concurrency streaming gzip

本文介绍了随机访问gzip压缩文件？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的文件压缩gzip坐在磁盘上。生产环境是云，所以存储性能可怕，但CPU很好。以前，我们的数据处理流程从 gzip -dc 开始，将数据流出磁盘。

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously our data processing pipeline began with gzip -dc streaming the data off the disk.

并行工作，我想运行多个管道，每个管道取一对字节偏移 - 开始和结束 - 并取该文件的块。对于普通文件，这可以通过 head 和 tail 来实现，但我不知道如何有效与压缩文件;如果我 gzip -dc 并且管道到头，朝向文件末尾的偏移对将涉及浪费

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through whole file as it's slowly decompressed.

所以我的问题是关于gzip算法 - 理论上可以寻找一个字节偏移在底层文件或获取它的任意块，没有解压缩整个文件到此点的全部含义？如果不是，我还能如何有效地将一个文件分区为多个进程的随机访问，同时最小化I / O吞吐量开销。

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for 'random' access by multiple processes while minimising the I/O throughput overhead?

随机访问gzip压缩文件？ [英] Random access to gzipped files?

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

随机访问gzip压缩文件？ [英] Random access to gzipped files?

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭