多处理比 Windows 中的串行处理慢(但不是在 Linux 中) [英] Multiprocessing slower than serial processing in Windows (but not in Linux)

查看:39
本文介绍了多处理比 Windows 中的串行处理慢(但不是在 Linux 中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试并行化 for 循环 以加速我的代码,因为循环处理操作都是独立的.按照在线教程,Python 中的标准 multiprocessing 库似乎是一个好的开始,我已经将其用于基本示例.

但是,对于我的实际用例,我发现在 Windows 上运行时,并行处理(使用双核机器)实际上要慢一点(<5%).然而,与串行执行相比,在 Linux 上运行相同的代码可使并行处理速度提高约 25%.

从文档来看,我认为这可能与 Window 缺少 fork() 函数有关,这意味着该进程每次都需要重新初始化.但是,我不完全理解这一点,不知道是否有人可以确认这一点?

特别是

--> 这是否意味着调用 python 文件中的所有代码都会为 Windows 上的每个并行进程运行,甚至初始化类和导入包?

--> 如果是这样,是否可以通过将类的副本(例如使用 deepcopy)传递给新进程来避免这种情况?

--> 是否有任何提示/其他策略可以有效地并行化 unix 和 windows 的代码设计.

我的确切代码很长并且使用了很多文件,所以我创建了一个伪代码样式的示例结构,希望能说明问题.

# 导入从 my_package 导入 MyClass导入许多其他包/功能# 初始化(实例化类并调用慢函数以使其准备好进行处理)my_class = 类()my_class.set_up(input1=1, input2=2)# 定义循环使用的主要处理函数定义计算(_input_data):# 对_input_data 执行一些函数......# 调用实例化类的方法对数据进行操作返回 my_class.class_func(_input_data)input_data = np.linspace(0, 1, 50)output_data = np.zeros_like(input_data)# For 循环(串行实现)对于 i, x in enumerate(input_data):输出数据[i] = 计算(x)# PARALLEL 实现(这行不通!)使用 multiprocessing.Pool(processes=4) 作为池:结果 = pool.map_async(计算,输入数据)结果.等待()output_data = results.get()

我不认为该问题与建议的问题重复,因为这与 Windows 和 Linunx 的差异有关,而在建议的重复问题中根本没有提到这一点.

解决方案

NT 操作系统缺少 UNIX fork 原语.当一个新进程被创建时,它作为一个空白进程开始.指导新进程如何引导是父进程的责任.

Python multiprocessing API 抽象了进程创建,试图为 forkforkserverspawn 提供相同的感觉> 启动方法.

当您使用 spawn 启动方法时,这就是幕后发生的事情.

  1. 创建了一个空白进程
  2. 空白进程启动一个全新的 Python 解释器
  3. Python 解释器会获得您通过 Process 类初始值设定项指定的 MFA(模块函数参数)
  4. Python 解释器加载给定模块,解析所有导入
  5. target 函数在模块中查找并使用给定的 argskwargs
  6. 调用

上述流程带来的影响很小.

正如您自己注意到的那样,与 fork 相比,这是一个更加繁重的操作.这就是您注意到性能差异的原因.

当模块在子进程中从头开始导入时,所有导入副作用都会重新执行.这意味着常量、全局变量装饰器和一级指令将再次执行.

另一方面,在父进程执行期间进行的初始化不会传播到子进程.请参阅这个 示例.

这就是为什么在 multiprocessing 文档中,他们在 编程指南.我强烈建议阅读编程指南,因为它们已经包含了编写可移植多处理代码所需的所有信息.

I'm trying to parallelize a for loop to speed-up my code, since the loop processing operations are all independent. Following online tutorials, it seems the standard multiprocessing library in Python is a good start, and I've got this working for basic examples.

However, for my actual use case, I find that parallel processing (using a dual core machine) is actually a little (<5%) slower, when run on Windows. Running the same code on Linux, however, results in a parallel processing speed-up of ~25%, compared to serial execution.

From the docs, I believe this may relate to Window's lack of fork() function, which means the process needs to be initialised fresh each time. However, I don't fully understand this and wonder if anyone can confirm this please?

Particularly,

--> Does this mean that all code in the calling python file gets run for each parallel process on Windows, even initialising classes and importing packages?

--> If so, can this be avoided by somehow passing a copy (e.g. using deepcopy) of the class into the new processes?

--> Are there any tips / other strategies for efficient parallelisation of code design for both unix and windows.

My exact code is long and uses many files, so I have created a pseucode-style example structure which hopefully shows the issue.

# Imports
from my_package import MyClass
imports many other packages / functions

# Initialization (instantiate class and call slow functions that get it ready for processing)
my_class = Class()
my_class.set_up(input1=1, input2=2)

# Define main processing function to be used in loop
def calculation(_input_data):
    # Perform some functions on _input_data
    ......
    # Call method of instantiate class to act on data
    return my_class.class_func(_input_data)

input_data = np.linspace(0, 1, 50)
output_data = np.zeros_like(input_data)

# For Loop (SERIAL implementation)
for i, x in enumerate(input_data):
    output_data[i] = calculation(x)

# PARALLEL implementation (this doesn't work well!)
with multiprocessing.Pool(processes=4) as pool:
    results = pool.map_async(calculation, input_data)
    results.wait()
output_data = results.get()

EDIT: I do not believe the question is a duplicate of the one suggested, since this relates to a difference in Windows and Linunx, which is not mentioned at all in the suggested duplicate question.

解决方案

NT Operating Systems lack the UNIX fork primitive. When a new process is created, it starts as a blank process. It's responsibility of the parent to instruct the new process on how to bootstrap.

Python multiprocessing APIs abstracts the process creation trying to give the same feeling for the fork, forkserver and spawn start methods.

When you use the spawn starting method, this is what happens under the hood.

  1. A blank process is created
  2. The blank process starts a brand new Python interpreter
  3. The Python interpreter is given the MFA (Module Function Arguments) you specified via the Process class initializer
  4. The Python interpreter loads the given module resolving all the imports
  5. The target function is looked up within the module and called with the given args and kwargs

The above flow brings few implications.

As you noticed yourself, it is a much more taxing operation compared to fork. That's why you notice such a difference in performance.

As the module gets imported from scratch in the child process, all import side effects are executed anew. This means that constants, global variables, decorators and first level instructions will be executed again.

On the other side, initializations made during the parent process execution will not be propagated to the child. See this example.

This is why in the multiprocessing documentation they added a specific paragraph for Windows in the Programming Guidelines. I highly recommend to read the Programming Guidelines as they already include all the required information to write portable multi-processing code.

这篇关于多处理比 Windows 中的串行处理慢(但不是在 Linux 中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆