Python-如何并行使用目录中的文件并对其进行操作 [英] Python - How to parallel consume and operate on files in a directory

查看:203
本文介绍了Python-如何并行使用目录中的文件并对其进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前情况:我在名为directoryA的目录中有900个文件.这些文件名为file0.txt到文件899.txt,每个大小为15MB.我在python中依次遍历每个文件.我将每个文件加载为列表,进行一些操作,然后在目录B中写出输出文件.循环结束时,目录B中有900个文件.这些文件名为out0.csv到out899.csv.

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

问题:每个文件的处理需要3分钟,因此脚本运行了40多个小时.我想以并行方式运行该过程,因为所有文件都是彼此独立的(没有任何相互依赖性).我的机器上有12个内核.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

以下脚本按顺序运行.请帮助我并行运行.我已经使用相关的stackoverflow问题查看了python中的一些并行处理模块,但是由于我对python的了解不多,因此我很难理解它们.谢谢十亿.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

伪脚本

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

推荐答案

要充分利用您的硬件核心,最好使用多处理库.

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

多处理文档: https://docs.python.org/2/library/multiprocessing.html

这篇关于Python-如何并行使用目录中的文件并对其进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆