Python - 如何并行使用和操作目录中的文件 [英] Python - How to parallel consume and operate on files in a directory

查看:19
本文介绍了Python - 如何并行使用和操作目录中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前方案:我在名为 directoryA 的目录中有 900 个文件.这些文件被命名为 file0.txt 到文件 899.txt,每个文件的大小为 15MB.我在 python 中按顺序循环遍历每个文件.我将每个文件加载为一个列表,执行一些操作,然后在目录 B 中写出一个输出文件.当循环结束时,目录 B 中有 900 个文件.文件名为 out0.csv 到 out899.csv.

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

问题:每个文件的处理时间为3分钟,使得脚本运行时间超过40小时.我想以并行方式运行该过程,因为所有文件都是相互独立的(没有任何相互依赖关系).我的机器有 12 个内核.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

以下脚本按顺序运行.请帮我并行运行它.我已经使用相关的 stackoverflow 问题查看了 python 中的一些并行处理模块,但它们对我来说很难理解,因为我对 python 没有太多接触.感谢十亿.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

伪脚本

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

推荐答案

要充分利用您的硬件核心,最好使用多处理库.

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

多处理文档:https://docs.python.org/2/library/multiprocessing.html

这篇关于Python - 如何并行使用和操作目录中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆