解压文件的 Python 并行处理 [英] Python parallel processing to unzip files

查看:93
本文介绍了解压文件的 Python 并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 并行处理的新手.我在下面有一段代码,它遍历所有目录并解压缩所有 tar.gz 文件.但是,这需要相当多的时间.

I'm new to parallel processing in python. I have a piece of code below, that walks through all directories and unzips all tar.gz files. However, it takes quite a bit of time.

import tarfile
import gzip
import os

def unziptar(path):
    for root, dirs, files in os.walk(path):
        for i in files:
            fullpath = os.path.join(root, i)
            if i.endswith("tar.gz"):
                print 'extracting... {}'.format(fullpath)
                tar = tarfile.open(fullpath, 'r:gz')
                tar.extractall(root)
                tar.close()

path = 'C://path_to_folder'
unziptar(path)

print 'tar.gz extraction completed'

我一直在浏览一些关于 multiprocessing 和 joblib 包的帖子,但我仍然不清楚如何修改我的脚本以并行运行.任何帮助表示赞赏.

I have been looking through some posts for multiprocessing and joblib packages but I'm still not v clear how to modify my script to run parallel. Any help is appreciated.

编辑:@tdelaney

感谢您的帮助,令人惊讶的是,修改后的脚本解压缩所有内容所需的时间是原来的两倍(60 分钟与原始脚本的 30 分钟相比)!

Thanks for the help, the surprising thing is that the modified script took twice the time to unzip everything (60mins compare to 30min with the original script)!

我查看了任务管理器,发现虽然使用了多核,但 CPU 使用率很低.我不知道为什么会这样.

I look at the task manager and it appears that while multi-cores were utilised, the CPU usage is v low. I'm not sure why this is so.

推荐答案

创建一个池来完成这项工作非常容易.只需将提取器拉出单独的工人即可.

It's a pretty easy to create a pool to do the work. Just pull the extractor out into a separate worker.

import tarfile
import gzip
import os
import multiprocessing as mp

def unziptar(fullpath):
    """worker unzips one file"""
    print 'extracting... {}'.format(fullpath)
    tar = tarfile.open(fullpath, 'r:gz')
    tar.extractall(os.path.dirname(fullpath))
    tar.close()

def fanout_unziptar(path):
    """create pool to extract all"""
    my_files = []
    for root, dirs, files in os.walk(path):
        for i in files:
            if i.endswith("tar.gz"):
                my_files.append(os.path.join(root, i))

    pool = mp.Pool(min(mp.cpu_count(), len(my_files))) # number of workers
    pool.map(unziptar, my_files, chunksize=1)
    pool.close()


if __name__=="__main__":
    path = 'C://path_to_folder'
    fanout_unziptar(path)
    print 'tar.gz extraction has completed'

这篇关于解压文件的 Python 并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆