在进程中运行线程 [英] Running threads inside processes
问题描述
我正在使用多处理功能在庞大的数据集上运行图像处理,我想知道与仅在所有项目上运行Pool相比,在Pool内部运行ThreadPoolExecutor是否有任何好处.
Im running image processing on a huge dataset with multiprocessing and Im wondering if running ThreadPoolExecutor inside a Pool provides any benefit vs just simply running Pool on all items.
数据集包含多个文件夹,每个文件夹都包含图像,因此我的最初工作是将每个文件夹分为一个进程,并将该文件夹中的每个图像拆分为一个线程.另一种方法是只获取每个图像并将其作为进程运行.
The dataset contains multiple folders with each folder containing images, so my initial though was to split up each folder in to a process and each image in that folder to a thread. Other way would be to just get every image and run that as a process.
例如,每个文件夹作为一个进程,每个图像作为一个线程
for instance, each folder as a process and each image as a thread
from concurrent import futures
from multiprocessing import Pool
from pathlib import Path
def handle_image(image_path: Path):
pass
def handle_folder(folder_path: Path):
with futures.ThreadPoolExecutor() as e:
e.map(handle_image, folder_path.glob("*"))
e.shutdown()
if __name__ == '__main__':
dataset_folder = Path("Folder")
with Pool() as p:
p.imap_unordered(handle_folder, dataset_folder.iterdir())
p.close()
p.join()
将每个图像作为一个过程
versus each image as a process
from multiprocessing import Pool
from pathlib import Path
def handle_image(image_path: Path):
if not image_path.is_file():
return
if __name__ == '__main__':
dataset_folder = Path("Folder")
with Pool() as p:
p.imap_unordered(handle_image, dataset_folder.glob("**/*"), 100)
p.close()
p.join()
推荐答案
您的任务(图像处理)听起来受CPU限制,因此除非您委托某个C库,否则线程将没有足够的空闲时间让彼此执行释放了大部分处理的GIL.
Your task (image processing) sounds CPU-bound, so threads won't have enough idle time to let each other execute unless you are delegating to some C library that releases the GIL for most of the processing.
但是,如果处理时间与I/O时间相当,则每个进程最多可以加快几个线程的速度(参见
If, however, processing time is comparable to I/O time, you may get a speedup for up to a few threads per process (cf. 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task for how times compare for a much more I/O-bound task).
请注意,对于大规模的分布式工作,您可以查看一个Python的分布式任务队列的第三方实现,而不是内置池和map
.
As a side note, for large-scale distributed work, you may take a look at one of the 3rd-party implementations of a distributed task queue for Python instead of the built-in pools and map
.
这篇关于在进程中运行线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!