子进程 + 多进程 - 按顺序执行多个命令 [英] subprocess + multiprocessing - multiple commands in sequence

查看:46
本文介绍了子进程 + 多进程 - 按顺序执行多个命令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组命令行工具,我想对一系列文件并行运行.我写了一个 python 函数来包装它们,看起来像这样:

def process_file(fn):打印 os.getpid()cmd1 = "回声"+fnp = subprocess.Popen(shlex.split(cmd1))# cmd1 结束后other_python_function_to_do_something_to_file(fn)cmd2 = "回声"+fnp = subprocess.Popen(shlex.split(cmd2))打印完成"如果 __name__=="__main__":导入多处理p = multiprocessing.Pool()对于文件中的 fn:RETURN = p.apply_async(process_file,args=(fn,),kwds={some_kwds})

虽然这有效,但它似乎没有运行多个进程;它似乎只是串行运行(我尝试使用 Pool(5) 并得到相同的结果).我错过了什么?对 Popen 的调用是否阻塞"了?

澄清一点.我需要 cmd1,然后是一些 python 命令,然后是 cmd2,在每个文件上依次执行.

上面的输出具有以下模式:

pid结束进程号结束进程号结束

而类似的调用,使用 map 代替 apply(但没有任何传递 kwds 的规定)看起来更像

pid进程号进程号结束结束结束

但是,地图调用有时(总是?)在显然成功后挂起

解决方案

Popen 的调用是否阻塞"了?

没有.只需创建一个 subprocess.Popen 立即返回,为您提供一个您可以等待或以其他方式使用的对象.如果你想阻止,那很简单:

subprocess.check_call(shlex.split(cmd1))

与此同时,我不确定您为什么将参数放在一个字符串中,然后尝试将它们 shlex 放回列表中.为什么不直接写清单?

cmd1 = ["echo", fn]subprocess.check_call(cmd1)

<块引用>

虽然这有效,但它似乎没有运行多个进程;好像只是串行运行

是什么让你这么想?鉴于每个进程只是尽可能快地将两个进程启动到后台,因此很难判断它们是否在并行运行.

如果您想验证您是否从多个处理中获得工作,您可能需要添加一些打印或日志记录(并将诸如 os.getpid() 之类的内容扔到消息中).

与此同时,您似乎正在尝试从 multiprocessing.Pool.apply_async 的循环中完全复制 multiprocessing.Pool.map_async 的效果,除了不是累积结果,而是将每个结果都存储在名为 RESULT 的变量中,然后在使用之前将其丢弃.为什么不直接使用 map_async?

最后,您询问了 multiprocessing 是否适合这项工作.好吧,你显然需要一些异步的东西:check_call(args(file1)) 必须阻塞 other_python_function_to_do_something_to_file(file1),但同时不能阻塞 check_call(args(file2)).

我可能会使用 threading,但实际上,它没有太大区别.即使您在一个进程启动成本很高的平台上,您也已经为此付出了代价,因为整点都在运行 N * M 束子进程,因此另一个 8 个池不会有任何伤害.并且通过在线程之间共享数据而意外创建竞争,或者意外创建看起来在进程之间共享数据但不共享数据的代码的风险很小,因为没有什么可共享的.所以,你更喜欢哪一个,就去吧.

另一种选择是编写一个事件循环.我实际上可能会开始为这个问题自己做,但我会后悔的,你不应该这样做......

I have a set of command line tools that I'd like to run in parallel on a series of files. I've written a python function to wrap them that looks something like this:

def process_file(fn):
    print os.getpid()
    cmd1 = "echo "+fn
    p = subprocess.Popen(shlex.split(cmd1))

    # after cmd1 finishes
    other_python_function_to_do_something_to_file(fn)

    cmd2 = "echo "+fn
    p = subprocess.Popen(shlex.split(cmd2))
    print "finish"

if __name__=="__main__":
    import multiprocessing
    p = multiprocessing.Pool()
    for fn in files:
        RETURN = p.apply_async(process_file,args=(fn,),kwds={some_kwds})

While this works, it does not seem to be running multiple processes; it seems like it's just running in serial (I've tried using Pool(5) with the same result). What am I missing? Are the calls to Popen "blocking"?

EDIT: Clarified a little. I need cmd1, then some python command, then cmd2, to execute in sequence on each file.

EDIT2: The output from the above has the pattern:

pid
finish
pid
finish
pid
finish

whereas a similar call, using map in place of apply (but without any provision for passing kwds) looks more like

pid
pid
pid
finish
finish
finish

However, the map call sometimes (always?) hangs after apparently succeeding

解决方案

Are the calls to Popen "blocking"?

No. Just creating a subprocess.Popen returns immediately, giving you an object that you could wait on or otherwise use. If you want to block, that's simple:

subprocess.check_call(shlex.split(cmd1))

Meanwhile, I'm not sure why you're putting your args together into a string and then trying to shlex them back to a list. Why not just write the list?

cmd1 = ["echo", fn]
subprocess.check_call(cmd1)

While this works, it does not seem to be running multiple processes; it seems like it's just running in serial

What makes you think this? Given that each process just kicks off two processes into the background as fast as possible, it's going to be pretty hard to tell whether they're running in parallel.

If you want to verify that you're getting work from multiple processing, you may want to add some prints or logging (and throw something like os.getpid() into the messages).

Meanwhile, it looks like you're trying to exactly duplicate the effects of multiprocessing.Pool.map_async out of a loop around multiprocessing.Pool.apply_async, except that instead of accumulating the results you're stashing each one in a variable called RESULT and then throwing it away before you can use it. Why not just use map_async?

Finally, you asked whether multiprocessing is the right tool for the job. Well, you clearly need something asynchronous: check_call(args(file1)) has to block other_python_function_to_do_something_to_file(file1), but at the same time not block check_call(args(file2)).

I would probably have used threading, but really, it doesn't make much difference. Even if you're on a platform where process startup is expensive, you're already paying that cost because the whole point is running N * M bunch of child processes, so another pool of 8 isn't going to hurt anything. And there's little risk of either accidentally creating races by sharing data between threads, or accidentally creating code that looks like it shares data between processes that doesn't, since there's nothing to share. So, whichever one you like more, go for it.

The other alternative would be to write an event loop. Which I might actually start doing myself for this problem, but I'd regret it, and you shouldn't do it…

这篇关于子进程 + 多进程 - 按顺序执行多个命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆