Python - 如何并行使用和操作目录中的文件 [英] Python - How to parallel consume and operate on files in a directory
问题描述
当前方案:我在名为 directoryA 的目录中有 900 个文件.这些文件被命名为 file0.txt 到文件 899.txt,每个文件的大小为 15MB.我在 python 中按顺序循环遍历每个文件.我将每个文件加载为一个列表,执行一些操作,然后在目录 B 中写出一个输出文件.当循环结束时,目录 B 中有 900 个文件.文件名为 out0.csv 到 out899.csv.
Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.
问题:每个文件的处理时间为3分钟,使得脚本运行时间超过40小时.我想以并行方式运行该过程,因为所有文件都是相互独立的(没有任何相互依赖关系).我的机器有 12 个内核.
Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.
以下脚本按顺序运行.请帮我并行运行它.我已经使用相关的 stackoverflow 问题查看了 python 中的一些并行处理模块,但它们对我来说很难理解,因为我对 python 没有太多接触.感谢十亿.
The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.
伪脚本
from os import listdir
import csv
mypath = "some/path/"
inputDir = mypath + 'dirA/'
outputDir = mypath + 'dirB/'
for files in listdir(inputDir):
#load the text file as list using csv module
#run a bunch of operations
#regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
#write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
推荐答案
要充分利用您的硬件核心,最好使用多处理库.
To fully utilize your hardware core, it's better to use the multiprocessing library.
from multiprocessing import Pool
from os import listdir
import csv
def process_file(file):
#load the text file as list using csv module
#run a bunch of operations
#regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
#write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
if __name__ == '__main__':
mypath = "some/path/"
inputDir = mypath + 'dirA/'
outputDir = mypath + 'dirB/'
p = Pool(12)
p.map(process_file, listdir(inputDir))
多处理文档:https://docs.python.org/2/library/multiprocessing.html
这篇关于Python - 如何并行使用和操作目录中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!