如何使用Python多处理池处理tarfile? [英] How can I process a tarfile with a Python multiprocessing pool?

查看:101
本文介绍了如何使用Python多处理池处理tarfile?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用multiprocessing.Pool处理tarfile的内容.我能够在多处理模块中成功使用ThreadPool实现,但是希望能够使用进程而不是线程,因为这样可能会更快,并且消除了Matplotlib处理多线程环境所做的一些更改.我收到一个我怀疑与进程不共享地址空间有关的错误,但是我不确定如何解决:

I'm trying to process the contents of a tarfile using multiprocessing.Pool. I'm able to successfully use the ThreadPool implementation within the multiprocessing module, but would like to be able to use processes instead of threads as it would possibly be faster and eliminate some changes made for Matplotlib to handle the multithreaded environment. I'm getting an error that I suspect is related to processes not sharing address space, but I'm not sure how to fix it:

Traceback (most recent call last):
  File "test_tarfile.py", line 32, in <module>
    test_multiproc()
  File "test_tarfile.py", line 24, in test_multiproc
    pool.map(read_file, files)
  File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 225, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 522, in get
    raise self._value
ValueError: I/O operation on closed file

实际程序更加复杂,但这是我正在做的一个示例,它会重现该错误:

The actual program is more complicated, but this is an example of what I'm doing that reproduces the error:

from multiprocessing.pool import ThreadPool, Pool
import StringIO
import tarfile

def write_tar():
    tar = tarfile.open('test.tar', 'w')
    contents = 'line1'
    info = tarfile.TarInfo('file1.txt')
    info.size = len(contents)
    tar.addfile(info, StringIO.StringIO(contents))
    tar.close()

def test_multithread():
    tar   = tarfile.open('test.tar')
    files = [tar.extractfile(member) for member in tar.getmembers()]
    pool  = ThreadPool(processes=1)
    pool.map(read_file, files)
    tar.close()

def test_multiproc():
    tar   = tarfile.open('test.tar')
    files = [tar.extractfile(member) for member in tar.getmembers()]
    pool  = Pool(processes=1)
    pool.map(read_file, files)
    tar.close()

def read_file(f):
    print f.read()

write_tar()
test_multithread()
test_multiproc()

我怀疑将TarInfo对象传递到另一个进程时出了点问题,但父级TarFile却没有,但是我不确定如何在多进程情况下进行修复.我可以不必从压缩包中提取文件并将其写入磁盘吗?

I suspect that the something's wrong when the TarInfo object is passed into the other process but the parent TarFile is not, but I'm not sure how to fix it in the multiprocess case. Can I do this without having to extract files from the tarball and write them to disk?

推荐答案

您没有将TarInfo对象传递给另一个进程,而是将tar.extractfile(member)的结果传递给了另一个进程,其中memberTarInfo对象. extractfile(...)方法返回一个类似文件的对象,该对象除其他外具有一个read()方法,该方法可对使用tar = tarfile.open('test.tar')打开的原始tar文件进行操作.

You're not passing a TarInfo object into the other process, you're passing the result of tar.extractfile(member) into the other process where member is a TarInfo object. The extractfile(...) method returns a file-like object which has, among other things, a read() method which operates upon the original tar file you opened with tar = tarfile.open('test.tar').

但是,您不能在一个进程中使用另一个进程中打开的文件,而必须重新打开该文件.我用这个替换了您的test_multiproc():

However, you can't use an open file from one process in another process, you have to re-open the file. I replaced your test_multiproc() with this:

def test_multiproc():
    tar   = tarfile.open('test.tar')
    files = [name for name in tar.getnames()]
    pool  = Pool(processes=1)
    result = pool.map(read_file2, files)
    tar.close()

并添加以下内容:

def read_file2(name):
    t2 = tarfile.open('test.tar')
    print t2.extractfile(name).read()
    t2.close()

并能够使您的代码正常工作.

and was able to get your code working.

这篇关于如何使用Python多处理池处理tarfile?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆