如何使用 Python 实现并行 gzip 压缩? [英] How does one achieve parallel gzip compression with Python?

查看:45
本文介绍了如何使用 Python 实现并行 gzip 压缩?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 python 压缩大文件 给出了一个很好的例子,说明如何使用,例如bz2 压缩一个非常大的文件集(或一个大文件)纯粹在 Python 中.

Big file compression with python gives a very nice example on how to use e.g. bz2 to compress a very large set of files (or a big file) purely in Python.

pigz 说利用并行压缩可以做得更好.据我所知(和谷歌搜索),我在纯 Python 代码中找不到与此等效的 Python.

pigz says you can do better by exploiting parallel compression. To my knowledge (and Google search) insofar I cannot find a Python equivalent to do so in pure Python code.

是否有用于 pigz 或等效物的并行 Python 实现?

Is there a parallel Python implementation for pigz or equivalent?

推荐答案

我不知道有一个用于 Python 的 pigz 接口,但如果你能编写它可能并不难真的需要它.Python 的 zlib 模块 允许压缩任意字节块,和 pigz 手册页 描述了用于并行化压缩的系统和输出格式已经.

I don't know of a pigz interface for Python off-hand, but it might not be that hard to write if you really need it. Python's zlib module allows compressing arbitrary chunks of bytes, and the pigz man page describes the system for parallelizing the compression and the output format already.

如果你真的需要并行压缩,应该可以使用 zlib 实现一个 pigz 等价物来压缩包裹在 multiprocessing.dummy.Pool.imap 中的块(multiprocessing.dummymultiprocessing API 的线程支持版本,因此您不会产生大量的 IPC 成本来向 worker 发送数据块和从 worker 发送数据块)并行化压缩.由于 zlib 是在 CPU 密集型工作期间释放 GIL 的少数内置模块之一,因此您实际上可能会从基于线程的并行性中获益.

If you really need parallel compression, it should be possible to implement a pigz equivalent using zlib to compress chunks wrapped in multiprocessing.dummy.Pool.imap (multiprocessing.dummy is the thread-backed version of the multiprocessing API, so you wouldn't incur massive IPC costs sending chunks to and from the workers) to parallelize the compression. Since zlib is one of the few built-in modules that releases the GIL during CPU-bound work, you might actually gain a benefit from thread based parallelism.

请注意,在实践中,当压缩级别没有调得那么高时,I/O 的成本通常与实际的 zlib 压缩相似(在数量级左右);如果您的数据源实际上无法以比压缩速度更快的速度提供给线程,那么您将不会从并行化中获得太多收益.

Note that in practice, when the compression level isn't turned up that high, I/O is often of similar (within order of magnitude or so) cost to the actual zlib compression; if your data source isn't able to actually feed the threads faster than they compress, you won't gain much from parallelizing.

这篇关于如何使用 Python 实现并行 gzip 压缩?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆