使用多处理程序运行脚本时会引发错误 [英] Script throws an error when it is made to run using multiprocessing

查看:119
本文介绍了使用多处理程序运行脚本时会引发错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用python与BeautifulSoup结合编写了一个脚本,以提取书籍的标题,这些书籍的标题是在亚马逊搜索框中提供一些ISBN编号后填充的.我正在从名为amazon.xlsx的excel文件中提供这些ISBN号.当我尝试使用以下脚本时,它将相应地解析标题并按预期写回excel文件.

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.

我在其中放置了isbn数字以填充结果的链接.

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook

wb = load_workbook('amazon.xlsx')
ws = wb['content']

def get_info(num):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': num
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
    soup = BeautifulSoup(res.text,"lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        get_data(itemlink['href'])

def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError: itmtitle = "N\A"

    print(itmtitle)

    ws.cell(row=row, column=2).value = itmtitle
    wb.save("amazon.xlsx")

if __name__ == '__main__':
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        get_info(val)

但是,当我尝试使用multiprocessing进行相同操作时,出现以下错误:

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined

对于multiprocessing,我在脚本中带来的更改是:

For multiprocessing what I brought changes in my script is:

from multiprocessing import Pool

if __name__ == '__main__':
    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row,column=1).value==None:break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        p.map(get_info,isbnlist)
        p.terminate()
        p.join()

我尝试过的ISBN很少:

Few of the ISBN I've tried with:

9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461

如何使用multiprocessing消除该错误并获得所需的结果?

How Can I get rid of that error and get the desired results using multiprocessing?

推荐答案

get_data()中引用全局变量row没有意义,因为

It does not make sense to reference the global variable row in get_data(), because

  1. 这是一个全局变量,不会在多处理池中的每个线程"之间共享,因为它们实际上是不共享全局变量的单独的python进程.

  1. It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

即使这样做了,因为在执行get_info()之前要构建整个ISBN列表,由于循环已完成,因此row的值将始终为ws.max_row + 1.

Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

因此,您需要提供行值作为传递给p.map()的第二个参数的数据的一部分.但是即使这样做,由于Windows文件锁定,从多个进程写入电子表格并保存它也是一个坏主意,

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool


def get_info(isbn):
    params = {
        'url': 'search-alias=aps',
        'field-keywords': isbn
    }
    res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
    soup = BeautifulSoup(res.text, "lxml")
    itemlink = soup.select_one("a.s-access-detail-page")
    if itemlink:
        return get_data(itemlink['href'])


def get_data(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    try:
        itmtitle = soup.select_one("#productTitle").get_text(strip=True)
    except AttributeError:
        itmtitle = "N\A"

    return itmtitle


def main():
    wb = load_workbook('amazon.xlsx')
    ws = wb['content']

    isbnlist = []
    for row in range(2, ws.max_row + 1):
        if ws.cell(row=row, column=1).value is None:
            break
        val = ws["A" + str(row)].value
        isbnlist.append(val)

    with Pool(10) as p:
        titles = p.map(get_info, isbnlist)
        p.terminate()
        p.join()

    for row in range(2, ws.max_row + 1):
        ws.cell(row=row, column=2).value = titles[row - 2]

    wb.save("amazon.xlsx")


if __name__ == '__main__':
    main()

这篇关于使用多处理程序运行脚本时会引发错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆