为什么要"multiprocessing.Pool"在Windows上无休止地运行? [英] Why does "multiprocessing.Pool" run endlessly on Windows?

查看:68
本文介绍了为什么要"multiprocessing.Pool"在Windows上无休止地运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经定义了函数 get_content 来从 https://www.investopedia.com/抓取数据.我尝试了 get_content('https://www.investopedia.com/terms/1/0x-protocol.asp'),它成功了.但是,该过程似乎可以在我的Windows笔记本电脑上无限运行.我检查了它是否可以在Google Colab和Linux笔记本电脑上正常运行.

I have defined the function get_content to crawl data from https://www.investopedia.com/. I tried get_content('https://www.investopedia.com/terms/1/0x-protocol.asp') and it worked. However, the process seems to run infinitely on my Windows laptop. I checked that it runs well on Google Colab and Linux laptops.

能否请您解释为什么我的功能在这种并行设置下不起作用?

Could you please elaborate why my function does not work in this parallel setting?

import requests
from bs4 import BeautifulSoup
from multiprocessing import dummy, freeze_support, Pool
import os
core = os.cpu_count() # Number of logical processors for parallel computing
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
session = requests.Session() 
links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    print(entry_name)

############ Parallel computing 
if __name__== "__main__":
    freeze_support()
    P_d = dummy.Pool(processes = core)
    P = Pool(processes = core)   
    #content_list = P_d.map(get_content, links)
    content_list = P.map(get_content, links)

Update1 :我在Anaconda发行版的JupyterLab中运行此代码.从下面的屏幕截图中可以看到,状态一直都是 busy .

Update1: I run this code in JupyterLab from Anaconda distribution. As you can see from below screenshot, the status is busy all the time.

Update2:该代码的执行可以在Spyder中完成,但仍不返回任何输出.

Update2: The execution of the code can finish in Spyder, but it still returns no output.

Update3:该代码在Colab中运行得很好:

Update3: The code runs perfectly fine in Colab:

推荐答案

在这里可以进行一些解压缩,但基本上所有内容都归结为python如何启动一个新进程并执行所需的功能.

Quite a bit to unpack here, but it basically all boils down to how python spins up a new process, and executes the function you want.

在* nix系统上,创建新进程的默认方法是使用 spawn (Windows也没有"fork",并且必须使用"spawn").

On *nix systems, the default way to create a new process is by using fork. This is great because it uses "copy-on-write" to give the new child process access to a copy of the parent's working memory. It is fast and efficient, but it comes with a significant drawback if you're using multithreading at the same time. Not everything actually gets copied, and some things can get copied in an invalid state (threads, mutexes, file handles etc). This can cause quite a number of problems if not handled correctly, and to get around those python can use spawn instead (also Windows doesn't have "fork" and must use "spawn").

Spawn基本上从头开始一个新的解释器,并且不会以任何方式复制父级的内存.但是,必须使用某种机制来使子级访问在创建子级之前定义的功能和数据,而python通过使新进程基本上从".py"导入 import * 来实现此目的.文件是从中创建的.这对于交互模式是有问题的,因为实际上并没有".py"文件导入 import ,并且是多处理不喜欢交互式" 问题的主要来源.将mp代码放入库中,然后导入并以交互方式执行 ,因为 可以从".py"导入.文件.这也是为什么我们使用 if __name__ =="__main __":行来分隔导入时您不想在子级中重新执行的任何代码的原因.如果您要在没有此过程的情况下生成新进程,则可以递归地保留子进程(尽管从技术上讲,该特定情况下的iirc具有内置的防护).

Spawn basically starts a new interpreter from scratch, and does not copy the parent's memory in any way. Some mechanism must be used to give the child access to functions and data defined before it was created however, and python does this by having that new process basically import * from the ".py" file it was created from. This is problematic with interactive mode because there isn't really a ".py" file to import, and is the primary source of "multiprocessing doesn't like interactive" problems. Putting your mp code into a library which you then import and execute does work in interactive, because it can be imported from a ".py" file. This is also why we use the if __name__ == "__main__": line to separate any code you don't want to be re-executed in the child when the import occurs. If you were to spawn a new process without this, it could recursively keep spawning children (though there's technically a built-in guard for that specific case iirc).

然后使用任一启动方法,父级通过 pipe (使用 pickle 交换python对象),告诉它要调用的函数以及参数是什么.这就是为什么参数必须可腌制的原因.某些东西无法腌制,这是 multiprocessing 中另一个常见的错误来源.

Then with either start method, the parent communicates with the child over a pipe (using pickle to exchange python objects) telling it what function to call, and what the arguments are. This is why arguments must be picklable. Some things can't be pickled, which is another common source of errors in multiprocessing.

最后,请注意,使用"spawn"时,IPython解释器(默认的Spyder shell)并不总是从子进程中收集stdout或stderr,这意味着将不会显示 print 语句.香草(python.exe)解释器可以更好地处理此问题.

Finally on another note, the IPython interpreter (the default Spyder shell) doesn't always collect stdout or stderr from child processes when using "spawn", meaning print statements won't be shown. The vanilla (python.exe) interpreter handles this better.

在您的特定情况下:

  • Jupyter实验室正在以交互方式运行,并且将创建子进程,但会出现类似无法从__main__导入get_content"之类的错误.错误未正确显示,因为该错误未在主要过程中发生,并且jupyter无法正确处理来自子级的stderr
  • Spyder使用的是IPython,默认情况下不会将 print 语句从子级中继到父级.您可以在此处切换到外部系统控制台"在运行"中对话框,但是您还必须执行一些操作,以保持窗口打开足够长的时间以读取输出(防止进程退出).
  • Google Colab使用的是运行Linux的Google服务器来执行代码,而不是在Windows计算机上本地执行代码,因此通过使用"fork"作为开始方法,特定问题是没有".py".从 import 导入文件不是问题.
  • Jupyter lab is running in interactive mode, and the child process will have been created but gotten an error something like "can't import get_content from __main__". The error doesn't get displayed correctly because it didn't happen in the main process, and jupyter doesn't handle stderr from the child correctly
  • Spyder is using IPython, by default which is not relaying the print statements from the child to the parent. Here you can switch to the "external system console" in the "run" dialog, but you then must also do something to keep the window open long enough to read the output (prevent the process from exiting).
  • Google Colab is using a google server running Linux to execute your code rather than executing it locally on your windows machine, so by using "fork" as the start method, the particular issue of not having a ".py" file to import from is not an issue.

这篇关于为什么要"multiprocessing.Pool"在Windows上无休止地运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆