刮取一个元素的不同位置时如何压缩脚本 [英] How to condense script when scraping different locations for one element

查看:68
本文介绍了刮取一个元素的不同位置时如何压缩脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个工作脚本来完成他们的工作.我想将它们结合起来以提高效率并减少冗余.我正在使用Python 3.7,Beautifulsoup 4.7.1,re和request.

I have 2 working scripts that do their job. I want to combine them for efficiency and reduce redundancy. I am using Python 3.7, Beautifulsoup 4.7.1, re, and requests.

脚本1搜索"li"并使用这些测试URL
https://www.amazon.com/dp/B00FSCBQV2
https://www.amazon.com/dp/B07L4YHBQ4
https://www.amazon.com/dp/B01N1ZD912
https://www.amazon.com/dp/B0040ODFK4

script 1 searches 'li' and works with these test URLs
https://www.amazon.com/dp/B00FSCBQV2
https://www.amazon.com/dp/B07L4YHBQ4
https://www.amazon.com/dp/B01N1ZD912
https://www.amazon.com/dp/B0040ODFK4

脚本2搜索'tr'并使用这些测试URL
https://www.amazon.com/dp/B00Q2XLI0U
https://www.amazon.com/dp/B00CYVCWXG

script 2 searches 'tr' and works with these test URLs
https://www.amazon.com/dp/B00Q2XLI0U
https://www.amazon.com/dp/B00CYVCWXG

我尝试使用(速记)尝试:脚本1别的:尝试:脚本2别的:通过

I tried using (shorthand) Try: script1 Else: Try: script2 else: pass

但是它变得毛茸茸并且失败了.我希望在try中使用,不过要通过格式.

But it gets hairy and fails. I would like it in the try , except, pass format.

#Script 1 
map_dict = {'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
fields = ['Amazon Best Sellers Rank']
final_dict = {}
#to handle null when writing to oracle later
final_dict['R1_NO'] = 'NA'
final_dict['R1_CAT'] = 'NA'
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'

for field in fields:
    element = soup.select_one('li:contains("' + field + '")')
    if element is None:
         item = dict(zip(map_dict[field], ['NA','NA']))
         final_dict = {**final_dict, **item}
    else:
        text = element.text
        i = 1
        for x,y in p.findall(text):
        prefix = 'R' + str(i) + '_'
        final_dict[prefix + 'NO'] = x
        final_dict[prefix + 'CAT'] = y.strip()
        i+=1

#Script 2 
map_dict = {'Best Sellers Rank': ['R1_NO','R1_CAT']}
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
fields = ['Best Sellers Rank']
final_dict = {}
#to handle null when writing to oracle later
final_dict['R1_NO'] = 'NA'
final_dict['R1_CAT'] = 'NA'
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'

for field in fields:
    element = soup.select_one('tr:contains("' + field + '")')
    if element is None:
         item = dict(zip(map_dict[field], ['NA','NA']))
         final_dict = {**final_dict, **item}
    else:
        text = element.text
        i = 1
        for x,y in p.findall(text):
        prefix = 'R' + str(i) + '_'
        final_dict[prefix + 'NO'] = x
        final_dict[prefix + 'CAT'] = y.strip()
        i+=1

我希望有一个组合的DRY脚本,该脚本可在所有提供的URL上使用.脚本将在"li"中查找,如果不在"li"中,则在"tr"中查找,如果不在,则将值分配为"NA".再次,这是分开工作的.

I expect to have a combined DRY script that works on all provided URLs. script would look in 'li' then if its not there it looks in "tr' and if its not there, the values are assigned 'NA'. Again, this works separately.

推荐答案

您可以将两者合并为一个(大多数情况下使用的是相同的代码).只需在两个字段之间使字段名称相同即可.:contains 仍将与缩短的 Best Sellers Rank 字段名称匹配,然后使用css Or语法处理 tr li

You could combine the two into one (as most is using same code). Simply make the field names the same across both. :contains will still match on shortened field name of Best Sellers Rank, and then use css Or syntax to handle tr versus li

import requests
from bs4 import BeautifulSoup as bs
import re

links = ['https://www.amazon.com/dp/B00FSCBQV2','https://www.amazon.com/dp/B00Q2XLI0U']
map_dict = {'Product Dimensions': 'dimensions', 'Shipping Weight': 'weight', 'Item model number': 'Item_No', 'Best Sellers Rank': ['R1_NO','R1_CAT']}

p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')

with requests.Session() as s:
    for link in links:
        r = s.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
        soup = bs(r.content, 'lxml')
        fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Best Sellers Rank']
        final_dict = {}

        for field in fields:
            element = soup.select_one('li:contains("' + field + '"), tr:contains("' + field + '")')
            if element is None:
                if field == 'Best Sellers Rank':
                    item = dict(zip(map_dict[field], ['N/A','N/A']))
                    final_dict = {**final_dict, **item}
                else:
                    final_dict[map_dict[field]] = 'N/A'
            else:
                if field == 'Best Sellers Rank':      
                    text = element.text
                    i = 1
                    for x,y in p.findall(text):
                        prefix = 'R' + str(i) + '_'
                        final_dict[prefix + 'NO'] = x  
                        final_dict[prefix + 'CAT'] = y.strip()
                        i+=1
                else:
                    item = [string for string in element.stripped_strings][1]
                    final_dict[map_dict[field]] = item.replace('(', '').strip()
        print(final_dict)

这篇关于刮取一个元素的不同位置时如何压缩脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆