需要帮助抓取带有“加载更多"按钮的页面；调用 AJAX 加载更多项目 [英] Need help scraping page with button "load more" calling AJAX to load more items

查看：20 发布时间：2021/9/24 19:03:32 python ajax selenium-webdriver web-scraping

本文介绍了需要帮助抓取带有“加载更多"按钮的页面；调用 AJAX 加载更多项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用以下代码:

from logging import exception
from selenium import webdriver 
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import datetime
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
import random
import pandas as pd
from Scrapingtools import joinfiles
from Scrapingtools import uploadfiles


options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--disable-extensions")
options.add_argument("--disable-dev-shm-usage")
#options.add_argument("--no-sandbox")
# options.add_argument("start-maximized")
# options.add_argument("window-size=1900,1080")

driver = webdriver.Chrome(executable_path=r"/usr/bin/chromedriver", options=options)

#url = 'https://www.pccomponentes.com/procesadores'

url_list = [
     'https://www.pccomponentes.com/procesadores',
    #  'https://www.pccomponentes.com/discos-duros/500-gb/conexiones-m-2/disco-ssd/internos',
    #  'https://www.pccomponentes.com/discos-duros/1-tb/conexiones-m-2/disco-ssd/internos',
    #  'https://www.pccomponentes.com/placas-base/amd-b550/atx',
    #  'https://www.pccomponentes.com/placas-base/amd-x570/atx',
    #  'https://www.pccomponentes.com/tarjetas-graficas',
    #  'https://www.pccomponentes.com/memorias-ram/16-gb/kit-2x8gb',
    #  'https://www.pccomponentes.com/ventiladores-cpu/socket-amd-am4',
    #  'https://www.pccomponentes.com/ventiladores-suplementarios/120x120',
    #  'https://www.pccomponentes.com/fuentes-alimentacion/750w/fuente-modular',
    #  'https://www.pccomponentes.com/cajas-pc/antec/atx/be-quiet/cooler-master/corsair/deepcool/fractal/lian-li/msi/phanteks/silverstone/tacens/tempest'
     ]

df_list =[] 
store = 'PCComponentes'
extraction_date = datetime.datetime.today().replace(microsecond=0)

page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
    # print(soup)
    items = soup.find_all('div',class_='col-xs-6 col-sm-4 col-md-4 col-lg-4')
    # print(len(items))
    print('Found' ,len(items), 'items in', url)    

    for item in items:

            product_name = item.find('h3',class_ = 'c-product-card__title').text.strip()
            try:
                price = item.find('div', class_ = 'c-product-card__prices-actual cy-product-price-normal').text[:-1]
            except AttributeError:
                price = item.find('div', class_ = 'c-product-card__prices-actual c-product-card__prices-actual--discount cy-product-price-discount').text[:-1]
            try:
                old_price = item.find('div',class_ = 'c-product-card__prices-pvp cy-product-price-normal').text[:-1]
            except AttributeError:
                old_price = "No discount"
            # try:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-inmediata cy-product-availability-date').text.strip()
            # except AttributeError:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-moderada cy-product-availability-date').text.strip()  
            # except AttributeError:
            #     availability = "No Date"  
            try:
                rating = item.find('span',class_ = 'c-star-rating__text cy-product-text').text.strip()
            except AttributeError:
                rating = ""
            try:
                reviews = item.find('span',class_ = 'c-star-rating__text cy-product-rating-result').text.strip()
            except AttributeError:
                reviews = ""
            try:
                brand = item.find('article')['data-brand'] 
            except AttributeError:
                brand = "No brand"
            try:
                category = item.find('article')['data-category']
            except AttributeError:
                category = "No category"
                   
            #  print(product_name, price, old_price, rating, reviews, brand, category, store, extraction_date)

            product_info =  {
                'product_name' : product_name,
                'price' : price,
                'old_price' : old_price,
              # 'availability' : availability,
                'rating' : rating,
                'reviews' : reviews,
                'brand' : brand,
                'category' : category,
                'store' : store,
                'date_extraction' : extraction_date,
            }
            df_list.append(product_info)
            
sleep(random.uniform(3.5, 7.5))

df = pd.DataFrame(df_list)
print(df)

它与其他工作正常的非常相似.这个页面的问题是底部进行ajax调用，不知道怎么处理.

It is very similar to other which works fine. The problem in this page is that the bottom make an ajax call and I don´t know how to handle.

底部似乎可以工作，但页面中没有新项目出现，脚本只检索前 24 个项目，而反过来又超过 100 个项目.在这种情况下，浏览器打开

The bottom seems to work but no new item appears in the page, and the script only retrieve de first 24 items when in turn there would be over one hundred. In this case, the browser opens

事实上，底部似乎进入了一个显示ver mas"的循环.和cargando"文本替代.

In fact it seems that the bottom enter in a loop showing "ver mas" and "cargando" text alternative.

我认为这可能是加载页面的问题，但测试不同的等待时间不起作用.

I thought it could be a problem with the load page, but testing with different time waiting doesn´t work.

有人可以帮我吗?

需要帮助抓取带有“加载更多"按钮的页面；调用 AJAX 加载更多项目 [英] Need help scraping page with button "load more" calling AJAX to load more items

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

需要帮助抓取带有“加载更多"按钮的页面；调用 AJAX 加载更多项目 [英] Need help scraping page with button &quot;load more&quot; calling AJAX to load more items

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

需要帮助抓取带有“加载更多"按钮的页面；调用 AJAX 加载更多项目 [英] Need help scraping page with button "load more" calling AJAX to load more items

登录关闭