Python-如何并行使用目录中的文件并对其进行操作 [英] Python - How to parallel consume and operate on files in a directory
问题描述
当前情况:我在名为directoryA的目录中有900个文件.这些文件名为file0.txt到文件899.txt,每个大小为15MB.我在python中依次遍历每个文件.我将每个文件加载为列表,进行一些操作,然后在目录B中写出输出文件.循环结束时,目录B中有900个文件.这些文件名为out0.csv到out899.csv.
Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.
问题:每个文件的处理需要3分钟,因此脚本运行了40多个小时.我想以并行方式运行该过程,因为所有文件都是彼此独立的(没有任何相互依赖性).我的机器上有12个内核.
Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.
以下脚本按顺序运行.请帮助我并行运行.我已经使用相关的stackoverflow问题查看了python中的一些并行处理模块,但是由于我对python的了解不多,因此我很难理解它们.谢谢十亿.
The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.
伪脚本
from os import listdir
import csv
mypath = "some/path/"
inputDir = mypath + 'dirA/'
outputDir = mypath + 'dirB/'
for files in listdir(inputDir):
#load the text file as list using csv module
#run a bunch of operations
#regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
#write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
推荐答案
要充分利用您的硬件核心,最好使用多处理库.
To fully utilize your hardware core, it's better to use the multiprocessing library.
from multiprocessing import Pool
from os import listdir
import csv
def process_file(file):
#load the text file as list using csv module
#run a bunch of operations
#regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
#write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
if __name__ == '__main__':
mypath = "some/path/"
inputDir = mypath + 'dirA/'
outputDir = mypath + 'dirB/'
p = Pool(12)
p.map(process_file, listdir(inputDir))
多处理文档: https://docs.python.org/2/library/multiprocessing.html
这篇关于Python-如何并行使用目录中的文件并对其进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!