如何用BeautifulSoup刮一下以保存汤元素,以使元素在页面中完成加载 [英] How to scrape with BeautifulSoup waiting a second to save the soup element to let elements load complete in the page

查看:82
本文介绍了如何用BeautifulSoup刮一下以保存汤元素,以使元素在页面中完成加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从

i'm trying to scrape data from THIS WEBSITE that have 3 kind of prices in some products, (muted price, red price and black price), i observed that the red price change before the page load when the product have 3 prices.

当我抓取网站时,我只得到两个价格,我认为如果代码等到页面完全加载后,我将得到所有价格.

When i scrape the website i get just two prices, i think if the code wait until the page fully load i will get all the prices.

这是我的代码:

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")

# Muted Price
MutedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-listPriceValue ph2 dib strike custom-list-price fw5 exito-vtex-component-precio-tachado'})[0].text
MutedPrice=pd.to_numeric(MutedPrice[2-len(MutedPrice):].replace('.',''))

# Red Price
RedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-sellingPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-rojo'})[0].text
RedPrice=pd.to_numeric(RedPrice[2-len(RedPrice):].replace('.',''))

# black Price
BlackPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-alliedPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-negro'})[0].text
BlackPrice=pd.to_numeric(BlackPrice[2-len(BlackPrice):].replace('.',''))

print('Muted Price:',MutedPrice)
print('Red Price:',RedPrice)
print('Black Price:',BlackPrice)

实际结果: 静音价格:3199900 红色价格:1649868 黑价:0

Actual Results: Muted Price: 3199900 Red Price: 1649868 Black Price: 0

预期结果: 静音价格:3199900 红色价格:1550032 黑价:1649868

Expected Results: Muted Price: 3199900 Red Price: 1550032 Black Price: 1649868

推荐答案

这些值可能是动态呈现的,即这些值可能由页面中的javascript填充.

It might be that those values are rendered dynamically i.e. the values might be populated by javascript in the page.

requests.get()只是返回从服务器接收到的标记,而无需任何进一步的客户端更改,因此它并不是完全等待.

requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting.

您也许可以使用 Selenium Chrome Webdriver 加载页面URL并获取页面源. (或者您可以使用Firefox驱动程序.)

You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source. (Or you can use Firefox driver).

转到chrome://settings/help检查当前的chrome版本,然后从

Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.

尝试用以下代码替换现有代码的前3行:

Try replace top 3 lines of your existing code with this:

from contextlib import closing
from selenium.webdriver import Chrome # pip install selenium

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'

# use Chrome to get page with javascript generated content
with closing(Chrome(executable_path="./chromedriver")) as browser:
     browser.get(url)
     page_source = browser.page_source

soup = BeautifulSoup(page_source, "lxml")

输出:

Muted Price: 3199900
Red Price: 1550032
Black Price: 1649868


参考:


References:

获取使用Python中的Javascript生成的页面

硒-chromedriver可执行文件必须位于PATH中

这篇关于如何用BeautifulSoup刮一下以保存汤元素,以使元素在页面中完成加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆