需要帮助抓取带有“加载更多"按钮的页面;调用 AJAX 加载更多项目 [英] Need help scraping page with button "load more" calling AJAX to load more items

查看:20
本文介绍了需要帮助抓取带有“加载更多"按钮的页面;调用 AJAX 加载更多项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码:

from logging import exception
from selenium import webdriver 
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import datetime
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
import random
import pandas as pd
from Scrapingtools import joinfiles
from Scrapingtools import uploadfiles


options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--disable-extensions")
options.add_argument("--disable-dev-shm-usage")
#options.add_argument("--no-sandbox")
# options.add_argument("start-maximized")
# options.add_argument("window-size=1900,1080")

driver = webdriver.Chrome(executable_path=r"/usr/bin/chromedriver", options=options)

#url = 'https://www.pccomponentes.com/procesadores'

url_list = [
     'https://www.pccomponentes.com/procesadores',
    #  'https://www.pccomponentes.com/discos-duros/500-gb/conexiones-m-2/disco-ssd/internos',
    #  'https://www.pccomponentes.com/discos-duros/1-tb/conexiones-m-2/disco-ssd/internos',
    #  'https://www.pccomponentes.com/placas-base/amd-b550/atx',
    #  'https://www.pccomponentes.com/placas-base/amd-x570/atx',
    #  'https://www.pccomponentes.com/tarjetas-graficas',
    #  'https://www.pccomponentes.com/memorias-ram/16-gb/kit-2x8gb',
    #  'https://www.pccomponentes.com/ventiladores-cpu/socket-amd-am4',
    #  'https://www.pccomponentes.com/ventiladores-suplementarios/120x120',
    #  'https://www.pccomponentes.com/fuentes-alimentacion/750w/fuente-modular',
    #  'https://www.pccomponentes.com/cajas-pc/antec/atx/be-quiet/cooler-master/corsair/deepcool/fractal/lian-li/msi/phanteks/silverstone/tacens/tempest'
     ]

df_list =[] 
store = 'PCComponentes'
extraction_date = datetime.datetime.today().replace(microsecond=0)

page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
    # print(soup)
    items = soup.find_all('div',class_='col-xs-6 col-sm-4 col-md-4 col-lg-4')
    # print(len(items))
    print('Found' ,len(items), 'items in', url)    

    for item in items:

            product_name = item.find('h3',class_ = 'c-product-card__title').text.strip()
            try:
                price = item.find('div', class_ = 'c-product-card__prices-actual cy-product-price-normal').text[:-1]
            except AttributeError:
                price = item.find('div', class_ = 'c-product-card__prices-actual c-product-card__prices-actual--discount cy-product-price-discount').text[:-1]
            try:
                old_price = item.find('div',class_ = 'c-product-card__prices-pvp cy-product-price-normal').text[:-1]
            except AttributeError:
                old_price = "No discount"
            # try:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-inmediata cy-product-availability-date').text.strip()
            # except AttributeError:
            #     availability = item.find('div', class_ = 'c-product-card__availability disponibilidad-moderada cy-product-availability-date').text.strip()  
            # except AttributeError:
            #     availability = "No Date"  
            try:
                rating = item.find('span',class_ = 'c-star-rating__text cy-product-text').text.strip()
            except AttributeError:
                rating = ""
            try:
                reviews = item.find('span',class_ = 'c-star-rating__text cy-product-rating-result').text.strip()
            except AttributeError:
                reviews = ""
            try:
                brand = item.find('article')['data-brand'] 
            except AttributeError:
                brand = "No brand"
            try:
                category = item.find('article')['data-category']
            except AttributeError:
                category = "No category"
                   
            #  print(product_name, price, old_price, rating, reviews, brand, category, store, extraction_date)

            product_info =  {
                'product_name' : product_name,
                'price' : price,
                'old_price' : old_price,
              # 'availability' : availability,
                'rating' : rating,
                'reviews' : reviews,
                'brand' : brand,
                'category' : category,
                'store' : store,
                'date_extraction' : extraction_date,
            }
            df_list.append(product_info)
            
sleep(random.uniform(3.5, 7.5))

df = pd.DataFrame(df_list)
print(df)

它与其他工作正常的非常相似.这个页面的问题是底部进行ajax调用,不知道怎么处理.

It is very similar to other which works fine. The problem in this page is that the bottom make an ajax call and I don´t know how to handle.

底部似乎可以工作,但页面中没有新项目出现,脚本只检索前 24 个项目,而反过来又超过 100 个项目.在这种情况下,浏览器打开

The bottom seems to work but no new item appears in the page, and the script only retrieve de first 24 items when in turn there would be over one hundred. In this case, the browser opens

事实上,底部似乎进入了一个显示ver mas"的循环.和cargando"文本替代.

In fact it seems that the bottom enter in a loop showing "ver mas" and "cargando" text alternative.

我认为这可能是加载页面的问题,但测试不同的等待时间不起作用.

I thought it could be a problem with the load page, but testing with different time waiting doesn´t work.

有人可以帮我吗?

推荐答案

最简单的方法可能是:

driver.execute_script('''
  setInterval(() => document.querySelector('#btnMore').click(), 1000)
''')
sleep(100)

你不必担心捕捉到任何东西,如果它不在那里,它就会悄悄地失败.

You don't have to worry about catching anything, it will quietly fail if it's not there.

这篇关于需要帮助抓取带有“加载更多"按钮的页面;调用 AJAX 加载更多项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆