使用多处理程序运行脚本时会引发错误 [英] Script throws an error when it is made to run using multiprocessing
问题描述
我用python与BeautifulSoup结合编写了一个脚本,以提取书籍的标题,这些书籍的标题是在亚马逊搜索框中提供一些ISBN编号后填充的.我正在从名为amazon.xlsx
的excel文件中提供这些ISBN号.当我尝试使用以下脚本时,它将相应地解析标题并按预期写回excel文件.
I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx
. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.
import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
wb = load_workbook('amazon.xlsx')
ws = wb['content']
def get_info(num):
params = {
'url': 'search-alias=aps',
'field-keywords': num
}
res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
soup = BeautifulSoup(res.text,"lxml")
itemlink = soup.select_one("a.s-access-detail-page")
if itemlink:
get_data(itemlink['href'])
def get_data(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
itmtitle = soup.select_one("#productTitle").get_text(strip=True)
except AttributeError: itmtitle = "N\A"
print(itmtitle)
ws.cell(row=row, column=2).value = itmtitle
wb.save("amazon.xlsx")
if __name__ == '__main__':
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
get_info(val)
但是,当我尝试使用multiprocessing
进行相同操作时,出现以下错误:
However, when I try to do the same using multiprocessing
I get the following error:
ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined
对于multiprocessing
,我在脚本中带来的更改是:
For multiprocessing
what I brought changes in my script is:
from multiprocessing import Pool
if __name__ == '__main__':
isbnlist = []
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
isbnlist.append(val)
with Pool(10) as p:
p.map(get_info,isbnlist)
p.terminate()
p.join()
我尝试过的ISBN很少:
Few of the ISBN I've tried with:
9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461
如何使用multiprocessing
消除该错误并获得所需的结果?
How Can I get rid of that error and get the desired results using multiprocessing
?
推荐答案
在get_data()
中引用全局变量row
没有意义,因为
It does not make sense to reference the global variable row
in get_data()
, because
-
这是一个全局变量,不会在多处理池中的每个线程"之间共享,因为它们实际上是不共享全局变量的单独的python进程.
It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.
即使这样做了,因为在执行get_info()
之前要构建整个ISBN列表,由于循环已完成,因此row
的值将始终为ws.max_row + 1
.
Even if they did, because you're building the entire ISBN list before executing get_info()
, the value of row
will always be ws.max_row + 1
because the loop has completed.
因此,您需要提供行值作为传递给p.map()
的第二个参数的数据的一部分.但是即使这样做,由于Windows文件锁定,从多个进程写入电子表格并保存它也是一个坏主意,
So you would need to provide the row values as part of the data passed to the second argument of p.map()
. But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:
import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool
def get_info(isbn):
params = {
'url': 'search-alias=aps',
'field-keywords': isbn
}
res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
soup = BeautifulSoup(res.text, "lxml")
itemlink = soup.select_one("a.s-access-detail-page")
if itemlink:
return get_data(itemlink['href'])
def get_data(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
try:
itmtitle = soup.select_one("#productTitle").get_text(strip=True)
except AttributeError:
itmtitle = "N\A"
return itmtitle
def main():
wb = load_workbook('amazon.xlsx')
ws = wb['content']
isbnlist = []
for row in range(2, ws.max_row + 1):
if ws.cell(row=row, column=1).value is None:
break
val = ws["A" + str(row)].value
isbnlist.append(val)
with Pool(10) as p:
titles = p.map(get_info, isbnlist)
p.terminate()
p.join()
for row in range(2, ws.max_row + 1):
ws.cell(row=row, column=2).value = titles[row - 2]
wb.save("amazon.xlsx")
if __name__ == '__main__':
main()
这篇关于使用多处理程序运行脚本时会引发错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!