WebScraping 的多处理不会在 Windows 和 Mac 上启动 [英] Multiprocessing for WebScraping wont start on Windows and Mac

查看:53
本文介绍了WebScraping 的多处理不会在 Windows 和 Mac 上启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前我在这里问了一个关于多处理的问题,一位用户给我发送了你可以在下面看到的答案.唯一的问题是这个答案在他的机器上有效,在我的机器上无效.

I asked a question here about multiprocessing a few days ago, and one user sent me the answer that you can see below. Only problem is that this answer worked on his machine and does not work on my machine.

我曾在 Windows (Python 3.6) 和 Mac (Python 3.8) 上尝试过.我已经在安装附带的基本 Python IDLE、Windows 上的 PyCharm 和 Jupyter Notebook 上运行了代码,但没有任何反应.我有 32 位 Python.这是代码:

I have tried on Windows (Python 3.6) and on Mac(Python 3.8). I have ran the code on basic Python IDLE that came with installation, in PyCharm on Windows and on Jupyter Notebook and nothing happens. I have 32 bit Python. This is the code:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

def parse(url):
    print("im in function")

    response = requests.get(url[4], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
    all_countries = soup.find_all("span", class_ = "country__name-short")

    discipline = url[0]
    season = url[1]
    competition = url[2]
    gender = url[3]

    out = []
    for name, country in zip(all_skier_names , all_countries):
        skier_name = name.text.strip().title()
        country = country.text.strip()
        out.append([discipline, season,  competition,  gender,  country,  skier_name])

    return out

all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode=']]

with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar:
    all_data = []
    print("im in pool")

    for data in pool.imap_unordered(parse, all_urls):
        print("im in data")

        all_data.extend(data)
        pbar.update()

print(all_data) 

我运行代码时唯一看到的是进度条,始终为 0%:

The only thing that I see when I run the code is progress bar, thats always at 0%:

  0%|          | 0/8 [00:00<?, ?it/s]

我在代码末尾的 parse(url) 函数和 for 循环 中设置了几个打印语句,但仍然只打印了我在游泳池里".好像代码根本不进入函数,代码最后也不进入for循环.

I set the couple of print statements in the parse(url) function and in for loop at the end of the code but still, only thing thats printed is "im in pool". It seams like code does not enter the function at all, and it does not go in for loop at the end of the code.

代码应该在 5-8 秒内执行,但我等了 10 分钟,什么也没发生.我也试过在没有进度条的情况下这样做,但结果是一样的.

The code should execute in 5-8 seconds, but Im waiting for 10 minutes and nothing is happening. I have also tried to do this without progress bar, but the result is the same.

你知道是什么问题吗?是我使用的 Python 版本(Python 3.6 32 位)或某些库版本的问题,IDK 该怎么做...

Do you know whats the problem? Is it the problem with version of Python that im using (Python 3.6 32 bit) or version of some lib, IDK what to do...

推荐答案

对您来说更好的选择是多线程,Python 使用线程模块实现:

A better choice for you would be multithreading, which Python implements using the threading module:

import threading
    
if __name__ == "__main__": 
logging.basicConfig(level=logging.INFO)
threads = list()

for scraper in scraper_list:
    logging.info("Main    : create and start thread %s.", scraper)
    x = threading.Thread(target=scraper_checker, args=(scraper,))
    threads.append(x)
    x.start()

for index, thread in enumerate(threads):
    thread.join()
    logging.info("Main    : thread %d done", index)

error_file.close()
success_file.close()
    
  
print("Done!") 

这篇关于WebScraping 的多处理不会在 Windows 和 Mac 上启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆