并行目录遍历python [英] parallel directory walk python

查看:129
本文介绍了并行目录遍历python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从给定的根位置开始读取目录树中的每个文件.我想使用并行机制尽快完成此操作.我可以使用48个内核和1 TB内存,因此线程资源不是问题.我还需要记录每个已读取的文件.

I need to read every file in the directory tree starting from a given root location. I would like to do this as fast as possible using parallelism. I have 48 cores at my disposal and 1 TB ram, so the thread resources are not an issue. I also need to log every file that was read.

我看过使用joblib,但无法将joblib与os.walk结合使用.

I looked at using joblib but am unable to combine joblib with os.walk.

我可以想到两种方式:

  • 遍历树并将所有文件添加到队列或列表中,并使工作线程池使文件出队-最佳负载平衡,由于最初的遍历&可能需要更多时间队列开销
  • 生成线程并为每个线程静态分配树的一部分-低负载平衡,无初始遍历,并基于某种哈希值分配目录.

还是有更好的方法?

EDIT 的存储性能.假设有一个无限快速的存储空间,可以处理无限数量的并行读取

EDIT performance of storage is not a concern. assume there is an infinitely fast storage that can handle infinite number of parallel reads

EDIT 删除了多节点情况,使重点放在并行目录漫游上

EDIT removed multinode situation to keep the focus on parallel directory walk

推荐答案

最简单的方法可能是使用multiprocessing.Pool处理在主过程中执行的os.walk的结果输出.

The simplest approach is probably to use a multiprocessing.Pool to process the results output of an os.walk performed in the main process.

这假定您要并行化的主要工作是对单个文件进行的处理,而不是递归扫描目录结构的工作.如果您的文件很小,并且您不需要对其内容进行大量处理,则可能不是这样.我还假设multiprocessing为您处理的流程创建将能够在群集上正确分配负载(可能是,也可能不是).

This assumes that the main work you want to parallelize is whatever processing takes place on the individual files, not the effort of recursively scanning the directory structure. This may not be true if your files are small and you don't need to do much processing on their contents. I'm also assuming that the process creation handled for you by multiprocessing will be able to properly distribute the load over your cluster (which may or may not be true).

import itertools
import multiprocessing

def worker(filename):
    pass   # do something here!

def main():
    with multiprocessing.Pool(48) as Pool: # pool of 48 processes

        walk = os.walk("some/path")
        fn_gen = itertools.chain.from_iterable((os.path.join(root, file)
                                                for file in files)
                                               for root, dirs, files in walk)

        results_of_work = pool.map(worker, fn_gen) # this does the parallel processing

完全有可能以这种方式并行化工作比仅在单个过程中完成工作要慢.这是因为共享文件系统基础硬盘上的IO可能是瓶颈,并且如果磁盘需要查找更多而不是读取更长的数据线性段,则尝试并行读取许多磁盘可能会使它们的运行速度变慢.即使IO快一点,进程之间通信的开销也会吞噬所有收益.

It is entirely possible that parallelizing the work this way will be slower than just doing the work in a single process. This is because IO on the hard disks underlying your shared filesystem may be the bottleneck and attempting many disk reads in parallel could make them all slower, if the disks needs to seek more often rather than reading longer linear sections of data. Even if the IO is a little faster, the overhead of communicating between the processes could eat up all of the gains.

这篇关于并行目录遍历python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆