为什么我们必须显式地将常量传递给多处理函数? [英] Why must we explicitly pass constants into multiprocessing functions?

查看:69
本文介绍了为什么我们必须显式地将常量传递给多处理函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用multiprocessing程序包来加快某些地理处理(GIS/arcpy)任务的速度,这些任务是多余的,并且对于2000多个相似的几何图形都需要执行相同的操作.

I have been working with the multiprocessing package to speed up some geoprocessing (GIS/arcpy) tasks that are redundant and need to be done the same for more than 2,000 similar geometries.

拆分工作得很好,但是我的工作者"功能相当长而且很复杂,因为任务本身从头到尾都很复杂.我想进一步细分这些步骤,但是我在将信息传递到工作者函数或从工作者函数传递信息时遇到了麻烦,因为出于某种原因,多处理中使用的工作者函数所需要的任何内容都必须显式传递.

The splitting up works well, but my "worker" function is rather long and complicated because the task itself from start to finish is complicated. I would love to break the steps apart down more but I am having trouble passing information to/from the worker function because for some reason ANYTHING that a worker function under multiprocessing uses needs to be passed in explicitly.

这意味着我无法在if __name__ == '__main__'主体中定义常量,然后在worker函数中使用它们.这也意味着我的worker函数的参数列表变得非常长-这很丑陋,因为尝试使用多个参数还需要创建一个辅助"star"函数,然后itertools将其重新压缩回来( 此问题上的第二个答案).

This means I cannot define constants in the body of if __name__ == '__main__' and then use them in the worker function. It also means that my parameter list for the worker function is getting really long - which is super ugly since trying to use more than one parameter also requires creating a helper "star" function and then itertools to rezip them back up (a la the second answer on this question).

我在下面创建了一个简单的示例,演示了我在说什么.是否有任何解决方法-我应该使用其他方法-或至少有人可以解释为什么就是这样吗?

I have created a trivial example below that demonstrates what I am talking about. Are there any workarounds for this - a different approach I should be using - or can someone at least explain why this is the way it is?

注意:我正在Windows Server 2008 R2 Enterprise x64上运行它.

Note: I am running this on Windows Server 2008 R2 Enterprise x64.

编辑:我似乎还没有把我的问题弄清楚.我并不担心pool.map如何仅接受一个参数(尽管很烦人),但是我不明白为什么在if __name__ == '__main__'之外定义的函数的作用域不能访问该块中定义的东西,如果它用作多处理功能-除非您明确将其作为参数传递,否则会令人讨厌.

I seem to have not made my question clear enough. I am not that concerned with how pool.map only takes one argument (although it is annoying) but rather I do not understand why the scope of a function defined outside of if __name__ == '__main__' cannot access things defined inside that block if it is used as a multiprocessing function - unless you explicitly pass it as an argument, which is obnoxious.

import os
import multiprocessing
import itertools

def loop_function(word):
    file_name = os.path.join(root_dir, word + '.txt')
    with open(file_name, "w") as text_file:
        text_file.write(word + " food")

def nonloop_function(word, root_dir): # <------ PROBLEM
    file_name = os.path.join(root_dir, word + '.txt')
    with open(file_name, "w") as text_file:
        text_file.write(word + " food")

def nonloop_star(arg_package):
     return nonloop_function(*arg_package)

# Serial version
#
# if __name__ == '__main__':
# root_dir = 'C:\\hbrowning'
# word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
# for word in word_list:
#     loop_function(word)
#
## --------------------------------------------

# Multiprocessing version
if __name__ == '__main__':
    root_dir = 'C:\\hbrowning'
    word_list = ['dog', 'cat', 'llama', 'yeti', 'parakeet', 'dolphin']
    NUM_CORES = 2
    pool = multiprocessing.Pool(NUM_CORES, maxtasksperchild=1)

    results = pool.map(nonloop_star, itertools.izip(word_list, itertools.repeat(root_dir)),
                   chunksize=1)
    pool.close()
    pool.join()

推荐答案

问题是,至少在Windows上(尽管也有* nix fork风格的多处理方式的类似警告),当您执行脚本时,它(为简化起见)有效地结束了,就好像您用subprocess.Popen()调用了两个空白(shell)进程,然后执行它们一样:

The problem is, at least on Windows (although there are similar caveats with *nix fork style of multiprocessing, too) that, when you execute your script, it (to greatly simplify it) effectively ends up as as if you called two blank (shell) processes with subprocess.Popen() and then have them execute:

python -c "from your_script import nonloop_star; nonloop_star(('dog', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('cat', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('yeti', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('parakeet', 'C:\\hbrowning'))"
python -c "from your_script import nonloop_star; nonloop_star(('dolphin', 'C:\\hbrowning'))"

这些过程之一在上一个调用结束后立即

一个.这意味着您的if __name__ == "__main__"块永远不会执行(因为它不是主脚本,因此将作为模块导入),因此其中声明的任何内容都不容易被该函数使用(因为从未对其进行评估).

one by one as soon as one of those processes finishes with the previous call. That means that your if __name__ == "__main__" block never gets executed (because it's not the main script, it's imported as a module) so anything declared within it is not readily available to the function (as it was never evaluated).

对于职能范围之外的员工,您至少可以通过sys.modules["your_script"]甚至使用globals()访问module作弊,但这仅适用于经评估的员工,因此放置在if __name__ == "__main__"保护器中的所有内容无法使用,因为它甚至没有机会.这也是为什么您必须在Windows上使用此防护的原因-如果没有它,您将在每个生成的进程中一次又一次地执行池创建以及嵌套在防护中的其他代码.

For the staff outside your function you can at least cheat by accessing your module via sys.modules["your_script"] or even with globals() but that works only for the evaluated staff, so anything that was placed within the if __name__ == "__main__" guard is not available as it didn't even had a chance. That's also a reason why you must use this guard on Windows - without it you'd be executing your pool creation, and other code that you nested within the guard, over and over again with each spawned process.

如果需要在多处理函数中共享只读数据,只需在脚本的全局命名空间中在__main__防护之外定义它,然后所有函数都可以访问它(当它重新存储时) -在启动新进程时进行评估),无论它们是否作为单独的进程运行.

If you need to share read-only data in your multiprocessing functions, just define it in the global namespace of your script, outside of that __main__ guard, and all functions will have the access to it (as it gets re-evaluated when starting a new process) regardless if they are running as separate processes or not.

如果您需要更改的数据,那么您需要使用可以在不同进程之间进行自我同步的东西-有许多为此目的而设计的模块,但是在大多数情况下,Python自己基于泡菜的数据报通信multiprocessing.Manager(和它提供的类型),尽管速度慢且不够灵活,但就足够了.

If you need data that changes then you need to use something that can synchronize itself over different processes - there is a slew of modules designed for that, but most of the time Python's own pickle-based, datagram communicating multiprocessing.Manager (and types it provides), albeit slow and not very flexible, is enough.

这篇关于为什么我们必须显式地将常量传递给多处理函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆