使用 selenium 和 BeautifulSoup 抓取不会返回页面中的所有项目 [英] Scraping with selenium and BeautifulSoup doesn´t return all the items in the page

查看:29
本文介绍了使用 selenium 和 BeautifulSoup 抓取不会返回页面中的所有项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我来自这里

现在我可以与页面交互,向下滚动页面,关闭出现的弹出窗口,然后单击底部以展开页面.

Now I am able to interact with the page, scroll down the page, close the popup that appears and click at the bottom to expand the page.

问题是当我统计物品时,代码只返回 20,应该是 40.

The problem is when I count the items, the code only returns 20 and it should be 40.

我一遍又一遍地检查代码 - 我遗漏了一些东西,但我不知道是什么.

I have checked the code again and again - I'm missing something but I don't know what.

请参阅下面的代码:

from selenium import webdriver 
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
#options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\\chromedriver.exe", options=options)

url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'

driver.get(url)  

iter=1
while True:
        scrollHeight = driver.execute_script("return document.documentElement.scrollHeight")
        Height=10*iter
        driver.execute_script("window.scrollTo(0, " + str(Height) + ");")
        
        if Height > scrollHeight:
            print('End of page')
            break
        iter+=1

time.sleep(3)

popup = driver.find_element_by_class_name('confirm').click()

time.sleep(3)

ver_mas = driver.find_elements_by_class_name('button-load-more')

for x in range(len(ver_mas)):

  if ver_mas[x].is_displayed():
      driver.execute_script("arguments[0].click();", ver_mas[x])
      time.sleep(10)

page_source = driver.page_source

soup = BeautifulSoup(page_source, 'lxml')
# print(soup)

items = soup.find_all('div',class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
````=

What is wrong?. I newbie in the scraping world.

Regards

推荐答案

您的 whilefor 语句无法按预期工作.

Your while and for statements don't work as intended.

  1. 使用 while True: 是一种不好的做法
  2. 你滚动到底部 - 但 button-load-more 按钮没有显示在那里 - Selenium 不会像显示的那样找到它
  3. find_elements_by_class_name - 查找多个元素 - 页面只有一个属于该类的元素
  4. if ver_mas[x].is_displayed(): 如果幸运的话,它只会执行一次,因为范围是 1
  1. Using while True: is a bad practice
  2. You scroll until the bottom - but the button-load-more button isn't displayed there - and Selenium will not find it as displayed
  3. find_elements_by_class_name - looks for multiple elements - the page has only one element with that class
  4. if ver_mas[x].is_displayed(): if you are lucky this will be executed only once because the range is 1

您可以在下面找到解决方案 - 此处代码查找按钮,移动到该按钮而不是滚动,然后执行单击.如果代码未能找到按钮 - 意味着所有项目都已加载 - 它会中断 while 并向前移动.

Below you can find the solution - here the code looks for the button, moves to it instead of scrolling, and performs a click. If the code fails to found the button - meaning that all the items were loaded - it breaks the while and moves forward.

url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'

driver.get(url)
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()

iter = 1
while iter > 0:
    time.sleep(3)
    try:
        ver_mas = driver.find_element_by_class_name('button-load-more')
        actions = ActionChains(driver)
        actions.move_to_element(ver_mas).perform()
        driver.execute_script("arguments[0].click();", ver_mas)

    except NoSuchElementException:
        break
    iter += 1

page_source = driver.page_source

soup = BeautifulSoup(page_source, 'lxml')
# print(soup)

items = soup.find_all('div', class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))

这篇关于使用 selenium 和 BeautifulSoup 抓取不会返回页面中的所有项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆