刮取一个元素的不同位置时如何压缩脚本 [英] How to condense script when scraping different locations for one element
问题描述
我有2个工作脚本来完成他们的工作.我想将它们结合起来以提高效率并减少冗余.我正在使用Python 3.7,Beautifulsoup 4.7.1,re和request.
I have 2 working scripts that do their job. I want to combine them for efficiency and reduce redundancy. I am using Python 3.7, Beautifulsoup 4.7.1, re, and requests.
脚本1搜索"li"并使用这些测试URL
https://www.amazon.com/dp/B00FSCBQV2
https://www.amazon.com/dp/B07L4YHBQ4
https://www.amazon.com/dp/B01N1ZD912
https://www.amazon.com/dp/B0040ODFK4
script 1 searches 'li' and works with these test URLs
https://www.amazon.com/dp/B00FSCBQV2
https://www.amazon.com/dp/B07L4YHBQ4
https://www.amazon.com/dp/B01N1ZD912
https://www.amazon.com/dp/B0040ODFK4
脚本2搜索'tr'并使用这些测试URL
https://www.amazon.com/dp/B00Q2XLI0U
https://www.amazon.com/dp/B00CYVCWXG
script 2 searches 'tr' and works with these test URLs
https://www.amazon.com/dp/B00Q2XLI0U
https://www.amazon.com/dp/B00CYVCWXG
我尝试使用(速记)尝试:脚本1别的:尝试:脚本2别的:通过
I tried using (shorthand) Try: script1 Else: Try: script2 else: pass
但是它变得毛茸茸并且失败了.我希望在try中使用,不过要通过格式.
But it gets hairy and fails. I would like it in the try , except, pass format.
#Script 1
map_dict = {'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
fields = ['Amazon Best Sellers Rank']
final_dict = {}
#to handle null when writing to oracle later
final_dict['R1_NO'] = 'NA'
final_dict['R1_CAT'] = 'NA'
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
item = dict(zip(map_dict[field], ['NA','NA']))
final_dict = {**final_dict, **item}
else:
text = element.text
i = 1
for x,y in p.findall(text):
prefix = 'R' + str(i) + '_'
final_dict[prefix + 'NO'] = x
final_dict[prefix + 'CAT'] = y.strip()
i+=1
#Script 2
map_dict = {'Best Sellers Rank': ['R1_NO','R1_CAT']}
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
fields = ['Best Sellers Rank']
final_dict = {}
#to handle null when writing to oracle later
final_dict['R1_NO'] = 'NA'
final_dict['R1_CAT'] = 'NA'
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'
for field in fields:
element = soup.select_one('tr:contains("' + field + '")')
if element is None:
item = dict(zip(map_dict[field], ['NA','NA']))
final_dict = {**final_dict, **item}
else:
text = element.text
i = 1
for x,y in p.findall(text):
prefix = 'R' + str(i) + '_'
final_dict[prefix + 'NO'] = x
final_dict[prefix + 'CAT'] = y.strip()
i+=1
我希望有一个组合的DRY脚本,该脚本可在所有提供的URL上使用.脚本将在"li"中查找,如果不在"li"中,则在"tr"中查找,如果不在,则将值分配为"NA".再次,这是分开工作的.
I expect to have a combined DRY script that works on all provided URLs. script would look in 'li' then if its not there it looks in "tr' and if its not there, the values are assigned 'NA'. Again, this works separately.
推荐答案
您可以将两者合并为一个(大多数情况下使用的是相同的代码).只需在两个字段之间使字段名称相同即可.:contains
仍将与缩短的 Best Sellers Rank
字段名称匹配,然后使用css Or语法处理 tr
与 li
You could combine the two into one (as most is using same code). Simply make the field names the same across both. :contains
will still match on shortened field name of Best Sellers Rank
, and then use css Or syntax to handle tr
versus li
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/dp/B00FSCBQV2','https://www.amazon.com/dp/B00Q2XLI0U']
map_dict = {'Product Dimensions': 'dimensions', 'Shipping Weight': 'weight', 'Item model number': 'Item_No', 'Best Sellers Rank': ['R1_NO','R1_CAT']}
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
with requests.Session() as s:
for link in links:
r = s.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Best Sellers Rank']
final_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '"), tr:contains("' + field + '")')
if element is None:
if field == 'Best Sellers Rank':
item = dict(zip(map_dict[field], ['N/A','N/A']))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[field]] = 'N/A'
else:
if field == 'Best Sellers Rank':
text = element.text
i = 1
for x,y in p.findall(text):
prefix = 'R' + str(i) + '_'
final_dict[prefix + 'NO'] = x
final_dict[prefix + 'CAT'] = y.strip()
i+=1
else:
item = [string for string in element.stripped_strings][1]
final_dict[map_dict[field]] = item.replace('(', '').strip()
print(final_dict)
这篇关于刮取一个元素的不同位置时如何压缩脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!