多处理池的Gzip问题 [英] Gzip issue with multiprocessing pool

查看:46
本文介绍了多处理池的Gzip问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要从多处理池写入的gzip文件句柄.不幸的是,输出文件似乎在某一点之后已损坏,因此可以执行 zcat out |wc 给出:

I have a gzip file handle that I'm writing to from a multiprocessing pool. Unfortunately, the output file seems to become corrupted after a certain point, so doing something like zcat out | wc gives:

gzip: out: invalid compressed data--format violated

我通过不使用gzip解决此问题.但是我很好奇为什么会发生这种情况以及是否有解决方案.

I'm dealing with the problem by not using gzip. But I'm curious as to why this is happening and if there is any solution.

不确定是否重要,但是我正在不受控制的远程Linux机器上运行代码,但我猜这是一台ubuntu机器.Python 2.7.3

Not sure if it matters, but I'm running the code on a remote linux machine that I don't control but my guess is that it's an ubuntu machine. Python 2.7.3

这是稍微简化的代码:

lock = Lock()
ohandle = gzip.open("out", "w")
def process(fn):
  rv = []
  for l in open(fn):
    sometext = dosomething(l)
    rv.append(sometext)


  lock.acquire()
  for sometext in rv:
    print >> ohandle, sometext
  lock.release()

pool = Pool(processes=4)
pm = pool.map(process, some_file_list])
ohandle.close()

推荐答案

请参见

  • 如果__name__ =='__main __',则应使用来保护调用部分.否则该部分将由子进程运行.
  • 将资源明确传递给子进程.( ohandle lock )
    • You should guard calling part with if __name__ == '__main__'. Or that part will be run by child process.
    • Explicitly pass resources to child processes. (ohandle, lock)

    我修改了您的代码,使其不使用锁并且不共享 ohandle .相反,我使用了临时文件.( fn +'.temp')

    I modified your code to not use lock and not to share ohandle. Instead I used temporary file. (fn + '.temp')

    警告:您应该检查文件名.如果后缀为".temp"的文件,我的代码可以删除您的数据.

    Caution: You should check filenames. If there is any file with '.temp' suffix, my code could delete your data.

    import os
    
    
    def process(fn):
        out_fn = fn + '.temp'
        with open(fn) as f, open(out_fn, 'w') as f2:
            for l in f:
                sometext = dosomething(l)
                print >> f2, sometext
        return out_fn
    
    if __name__ == '__main__':
        some_file_list = ...
        pool = Pool(processes=4)
    
        ohandle = gzip.open('out.gz', 'w')
        for fn in pool.map(process, some_file_list):
            with open(fn) as f:
                while True:
                    data = f.read(1<<12)
                    if not data: break
                    ohandle.write(data)
            os.unlink(fn)
        pool.close()
        pool.join()
    

    这篇关于多处理池的Gzip问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆