带有池/队列的Python多个子进程在一个完成后立即恢复输出并在队列中启动下一个作业 [英] Python multiple subprocess with a pool/queue recover output as soon as one finishes and launch next job in queue

查看:90
本文介绍了带有池/队列的Python多个子进程在一个完成后立即恢复输出并在队列中启动下一个作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在启动一个子进程,并在旅途中解析stdout,而无需等待它完成对stdout的解析.

I'm currently launching a subprocess and parsing stdout on the go without waiting for it to finish to parse stdout.

for sample in all_samples:
    my_tool_subprocess = subprocess.Popen('mytool {}'.format(sample),shell=True, stdout=subprocess.PIPE)
    line = True
    while line:
        myline = my_tool_subprocess.stdout.readline()
        #here I parse stdout..

在脚本中,我多次执行此操作,实际上取决于输入样本的数量.

In my script I perform this action multiple times, indeed depending on the number of input samples.

这里的主要问题是每个子进程都是一个程序/工具,在运行时会使用1个CPU占100%.这需要一些时间..每次输入可能需要20-40分钟.

Main problem here is that every subprocess is a program/tool that uses 1 CPU for 100% while it's running. And it takes sometime.. maybe 20-40 min per input.

我想要实现的是设置同时运行N个max子流程作业进程的池,队列(我不确定这里的确切术语是什么).因此,我可以最大限度地提高性能,而不必按顺序进行.

What I would like to achieve, is to set a pool, queue (I'm not sure what's the exact terminology here) of N max subprocess job process running at same time. So I could maximize performance, and not proceed sequentially.

因此,例如,最多4个作业池的执行流应为:

So an execution flow for example a max 4 jobs pool should be:

  • 启动4个子流程.
  • 其中一项作业完成后,解析标准输出,然后启动.
  • 执行此操作,直到队列中的所有作业完成.

如果我能做到这一点,我真的不知道如何确定哪个样本子流程已经完成.目前,我不需要识别它们,因为每个子进程都按顺序运行,并且当子进程正在打印stdout时,我将解析stdout.

If I can achieve this I really don't know how I could identify which sample subprocess is the one that has finished. At this moment, I don't need to identify them since each subprocess runs sequentially and I parse stdout as subprocess is printing stdout.

这非常重要,因为我需要确定每个子流程的输出并将其分配给相应的输入/样本.

This is really important, since I need to identify the output of each subprocess and assign it to it's corresponding input/sample.

推荐答案

ThreadPool可能非常适合您的问题,您可以设置工作线程的数量并添加作业,然后这些线程将在所有任务.

ThreadPool could be a good fit for your problem, you set the number of worker threads and add jobs, and the threads will work their way through all the tasks.

from multiprocessing.pool import ThreadPool
import subprocess


def work(sample):
    my_tool_subprocess = subprocess.Popen('mytool {}'.format(sample),shell=True, stdout=subprocess.PIPE)
    line = True
    while line:
        myline = my_tool_subprocess.stdout.readline()
        #here I parse stdout..


num = None  # set to the number of workers you want (it defaults to the cpu count of your machine)
tp = ThreadPool(num)
for sample in all_samples:
    tp.apply_async(work, (sample,))

tp.close()
tp.join()

这篇关于带有池/队列的Python多个子进程在一个完成后立即恢复输出并在队列中启动下一个作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆