在进程中运行线程 [英] Running threads inside processes

查看:92
本文介绍了在进程中运行线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用多处理功能在庞大的数据集上运行图像处理,我想知道与仅在所有项目上运行Pool相比,在Pool内部运行ThreadPoolExecutor是否有任何好处.

Im running image processing on a huge dataset with multiprocessing and Im wondering if running ThreadPoolExecutor inside a Pool provides any benefit vs just simply running Pool on all items.

数据集包含多个文件夹,每个文件夹都包含图像,因此我的最初工作是将每个文件夹分为一个进程,并将该文件夹中的每个图像拆分为一个线程.另一种方法是只获取每个图像并将其作为进程运行.

The dataset contains multiple folders with each folder containing images, so my initial though was to split up each folder in to a process and each image in that folder to a thread. Other way would be to just get every image and run that as a process.

例如,每个文件夹作为一个进程,每个图像作为一个线程

for instance, each folder as a process and each image as a thread

from concurrent import futures
from multiprocessing import Pool
from pathlib import Path


def handle_image(image_path: Path):
    pass


def handle_folder(folder_path: Path):
    with futures.ThreadPoolExecutor() as e:
        e.map(handle_image, folder_path.glob("*"))
        e.shutdown()


if __name__ == '__main__':
    dataset_folder = Path("Folder")
    with Pool() as p:
        p.imap_unordered(handle_folder, dataset_folder.iterdir())
        p.close()
        p.join()

将每个图像作为一个过程

versus each image as a process

from multiprocessing import Pool
from pathlib import Path


def handle_image(image_path: Path):
    if not image_path.is_file():
        return


if __name__ == '__main__':
    dataset_folder = Path("Folder")
    with Pool() as p:
        p.imap_unordered(handle_image, dataset_folder.glob("**/*"), 100)
        p.close()
        p.join()

推荐答案

您的任务(图像处理)听起来受CPU限制,因此除非您委托某个C库,否则线程将没有足够的空闲时间让彼此执行释放了大部分处理的GIL.

Your task (image processing) sounds CPU-bound, so threads won't have enough idle time to let each other execute unless you are delegating to some C library that releases the GIL for most of the processing.

但是,如果处理时间与I/O时间相当,则每个进程最多可以加快几个线程的速度(参见

If, however, processing time is comparable to I/O time, you may get a speedup for up to a few threads per process (cf. 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task for how times compare for a much more I/O-bound task).

请注意,对于大规模的分布式工作,您可以查看一个Python的分布式任务队列的第三方实现,而不是内置池和map.

As a side note, for large-scale distributed work, you may take a look at one of the 3rd-party implementations of a distributed task queue for Python instead of the built-in pools and map.

这篇关于在进程中运行线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆