为什么“multiprocessing.Pool"在 Windows 上无休止地运行? [英] Why does "multiprocessing.Pool" run endlessly on Windows?

查看:66
本文介绍了为什么“multiprocessing.Pool"在 Windows 上无休止地运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我定义了函数 get_content 来从 https://www.investopedia.com/ 抓取数据.我尝试了 get_content('https://www.investopedia.com/terms/1/0x-protocol.asp') 并且它起作用了.但是,该过程似乎在我的 Windows 笔记本电脑上无限运行.我检查过它在 Google Colab 和 Linux 笔记本电脑上运行良好.

I have defined the function get_content to crawl data from https://www.investopedia.com/. I tried get_content('https://www.investopedia.com/terms/1/0x-protocol.asp') and it worked. However, the process seems to run infinitely on my Windows laptop. I checked that it runs well on Google Colab and Linux laptops.

能否请您详细说明为什么我的功能在这种并行设置中不起作用?

Could you please elaborate why my function does not work in this parallel setting?

import requests
from bs4 import BeautifulSoup
from multiprocessing import dummy, freeze_support, Pool
import os
core = os.cpu_count() # Number of logical processors for parallel computing
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
session = requests.Session() 
links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    print(entry_name)

############ Parallel computing 
if __name__== "__main__":
    freeze_support()
    P_d = dummy.Pool(processes = core)
    P = Pool(processes = core)   
    #content_list = P_d.map(get_content, links)
    content_list = P.map(get_content, links)

Update1:​​我在 Anaconda 发行版的 JupyterLab 中运行此代码.从下面的截图中可以看出,状态一直是busy.

Update1: I run this code in JupyterLab from Anaconda distribution. As you can see from below screenshot, the status is busy all the time.

Update2:代码在Spyder中可以完成执行,但仍然没有输出.

Update2: The execution of the code can finish in Spyder, but it still returns no output.

Update3:代码在 Colab 中运行良好:

Update3: The code runs perfectly fine in Colab:

推荐答案

在这里解压有点多,但基本上都归结为 python 如何启动一个新进程,并执行你想要的函数.

Quite a bit to unpack here, but it basically all boils down to how python spins up a new process, and executes the function you want.

在 *nix 系统上,创建新进程的默认方式是使用 .这很棒,因为它使用了写时复制".让新的子进程访问父进程的工作内存的副本.它快速高效,但如果您同时使用多线程,它会带来一个明显的缺点.并非一切实际上都被复制了,有些东西可能会在无效状态下被复制(线程、互斥锁、文件句柄等).如果处理不当,这可能会导致相当多的问题,为了解决这些问题,python 可以使用 spawn 代替(Windows 也没有fork"并且必须使用spawn").

On *nix systems, the default way to create a new process is by using fork. This is great because it uses "copy-on-write" to give the new child process access to a copy of the parent's working memory. It is fast and efficient, but it comes with a significant drawback if you're using multithreading at the same time. Not everything actually gets copied, and some things can get copied in an invalid state (threads, mutexes, file handles etc). This can cause quite a number of problems if not handled correctly, and to get around those python can use spawn instead (also Windows doesn't have "fork" and must use "spawn").

Spawn 基本上是从头开始一个新的解释器,并且不会以任何方式复制父级的内存.然而,必须使用某种机制来让子进程访问在创建之前定义的函数和数据,而 python 通过让新进程基本上从.py"import * 来实现这一点.创建它的文件.这在交互模式下是有问题的,因为实际上并没有.py"文件.文件以import,并且是多处理不喜欢交互式"问题的主要来源. 将您的 mp 代码放入一个库中,然后您将其导入并执行确实以交互方式工作,因为它可以从.py"导入.文件.这也是我们使用 if __name__ == "__main__": 行来分隔您不希望在导入时在子进程中重新执行的任何代码的原因.如果你要在没有这个的情况下生成一个新进程,它可以递归地继续生成子进程(尽管技术上有一个针对特定情况 iirc 的内置保护).

Spawn basically starts a new interpreter from scratch, and does not copy the parent's memory in any way. Some mechanism must be used to give the child access to functions and data defined before it was created however, and python does this by having that new process basically import * from the ".py" file it was created from. This is problematic with interactive mode because there isn't really a ".py" file to import, and is the primary source of "multiprocessing doesn't like interactive" problems. Putting your mp code into a library which you then import and execute does work in interactive, because it can be imported from a ".py" file. This is also why we use the if __name__ == "__main__": line to separate any code you don't want to be re-executed in the child when the import occurs. If you were to spawn a new process without this, it could recursively keep spawning children (though there's technically a built-in guard for that specific case iirc).

然后使用任一启动方法,父母通过 pipe(使用pickle 来交换python 对象)告诉它要调用什么函数,以及参数是什么.这就是为什么参数必须是可腌制的.有些东西不能被pickle,这是multiprocessing中另一个常见的错误来源.

Then with either start method, the parent communicates with the child over a pipe (using pickle to exchange python objects) telling it what function to call, and what the arguments are. This is why arguments must be picklable. Some things can't be pickled, which is another common source of errors in multiprocessing.

最后,在使用spawn"时,IPython 解释器(默认的 Spyder shell)并不总是从子进程收集 stdout 或 stderr,这意味着不会显示 print 语句.vanilla (python.exe) 解释器可以更好地处理这个问题.

Finally on another note, the IPython interpreter (the default Spyder shell) doesn't always collect stdout or stderr from child processes when using "spawn", meaning print statements won't be shown. The vanilla (python.exe) interpreter handles this better.

在您的具体情况下:

  • Jupyter 实验室正在以交互模式运行,子进程将被创建,但出现类似无法从 __main__ 导入 get_content"的错误.错误没有正确显示,因为它没有发生在主进程中,并且 jupyter 没有正确处理来自子进程的 stderr
  • Spyder 使用 IPython,默认情况下它不会将 print 语句从子级传递到父级.在这里可以切换到外部系统控制台"在运行"中对话框,但您还必须做一些事情来保持窗口打开足够长的时间以读取输出(防止进程退出).
  • Google Colab 正在使用运行 Linux 的谷歌服务器来执行您的代码,而不是在您的 Windows 机器上本地执行它,因此通过使用fork"作为 start 方法,没有.py"的特殊问题被排除在外.import 的文件不是问题.
  • Jupyter lab is running in interactive mode, and the child process will have been created but gotten an error something like "can't import get_content from __main__". The error doesn't get displayed correctly because it didn't happen in the main process, and jupyter doesn't handle stderr from the child correctly
  • Spyder is using IPython, by default which is not relaying the print statements from the child to the parent. Here you can switch to the "external system console" in the "run" dialog, but you then must also do something to keep the window open long enough to read the output (prevent the process from exiting).
  • Google Colab is using a google server running Linux to execute your code rather than executing it locally on your windows machine, so by using "fork" as the start method, the particular issue of not having a ".py" file to import from is not an issue.

这篇关于为什么“multiprocessing.Pool"在 Windows 上无休止地运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆