使用 Beautifulsoup 从 html 获取 data-testid 和属性 [英] Get data-testid and attributes from html using Beautifulsoup

查看:20
本文介绍了使用 Beautifulsoup 从 html 获取 data-testid 和属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是网络开发新手.所以请善待.

我觉得解析这个标签真的很奇怪.

考虑以下 HTML 文档:

导入 urllib3从 bs4 导入 BeautifulSoupurl = 'https://www.carrefourkuwait.com/mafkwt/en/Frozen-Food/c/FKWT6000000?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance'请求 = urllib3.PoolManager()res = req.request('GET', url)汤 = BeautifulSoup(res.data, 'html.parser')汤

我正在尝试获取产品名称和价格.但是使用 soup.findAll('div', {'data-testid': 'product_name'}) 不起作用.

这里的问题是产品名称和价格是 你能帮忙解决这个问题吗?

我也无法滚动页面.我写了这段代码,但它不起作用(从页面=1 继续给我重复)

tag = 'Bakery/c/FKWT1610000'scrap_all = pd.DataFrame()对于 tqdm(range(1,10)) 中的 x:scrap_page = pd.DataFrame()r = requests.get(parent_url+tag+'?currentPage='+str(x)+'&filter=&nextPageOffset=0&pageSize=200&sortBy=relevance',headers = {'User-Agent':'Mozilla/5.0'})数据 = json.loads(re.search(r'(\{"prop.*\})', r.text).group(1))data = data['props']['initialState']['search']['products']scrap_page['item_desc'] = [i['name'] for i in data]scrap_page['item_price'] = [i['originalPrice'] for i in data]scrap_carrefour = pd.concat([scrap_carrefour,scrap_page])

解决方案

数据是从脚本标签动态提取的.由于 javascript 不会随请求一起运行,因此此信息保留在脚本标记中,不会出现在您正在查看的位置.

您可以将包含相关信息的字符串正则表达式,用 json 解析并创建一个字典,如下所示:

导入请求、re、jsonr = requests.get('https://www.carrefourkuwait.com/mafkwt/en/Frozen-Food/c/FKWT6000000?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance',headers = {'User-Agent':'Mozilla/5.0'})数据 = json.loads(re.search(r'(\{"prop.*\})', r.text).group(1))info = {i['name']:str(i['originalPrice'])+' '+ i['currency'] for i in data['props']['initialState']['search']['产品']}

Web-dev newbie here. so please be nice.

I find this tag really weird for me to parse.

Consider the following HTML doc:

import urllib3
from bs4 import BeautifulSoup

url = 'https://www.carrefourkuwait.com/mafkwt/en/Frozen-Food/c/FKWT6000000?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance'

req = urllib3.PoolManager()
res = req.request('GET', url)
soup = BeautifulSoup(res.data, 'html.parser')
soup

I am trying to get the product name and price. But using soup.findAll('div', {'data-testid': 'product_name'}) doesn't work.

The issue here is that product name and price are attributes of a link in the <a\> tag. Even with soup.findAll('a') I get nothing: []

Can you please help with this?

I also unable to scroll over the pages. I wrote this code but it doesn't work (keep giving me duplicate from page =1)

tag = 'Bakery/c/FKWT1610000' 

scrap_all = pd.DataFrame()

for x in tqdm(range(1,10)):
    scrap_page = pd.DataFrame()
    r = requests.get(parent_url+tag+'?currentPage='+str(x)+'&filter=&nextPageOffset=0&pageSize=200&sortBy=relevance',
                     headers = {'User-Agent':'Mozilla/5.0'})
    
    data = json.loads(re.search(r'(\{"prop.*\})', r.text).group(1))
    data = data['props']['initialState']['search']['products']
    scrap_page['item_desc'] = [i['name'] for i in data]
    scrap_page['item_price'] = [i['originalPrice'] for i in data]

    scrap_carrefour = pd.concat([scrap_carrefour,scrap_page])

解决方案

Data is dynamically pulled from a script tag. As javascript doesn't run with requests this info remains within the script tag and is not present where you are looking.

You can regex out the string holding the relevant info, parse with json and create a dict as follows:

import requests, re, json

r = requests.get('https://www.carrefourkuwait.com/mafkwt/en/Frozen-Food/c/FKWT6000000?currentPage=1&filter=&nextPageOffset=0&pageSize=60&sortBy=relevance',
                 headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"prop.*\})', r.text).group(1))
info = {i['name']:str(i['originalPrice'])+ ' '+ i['currency'] for i in data['props']['initialState']['search']['products']}

这篇关于使用 Beautifulsoup 从 html 获取 data-testid 和属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆