使用 selenium 和 BeautifulSoup 抓取不会返回页面中的所有项目 [英] Scraping with selenium and BeautifulSoup doesn´t return all the items in the page
问题描述
所以我来自这里
现在我可以与页面交互,向下滚动页面,关闭出现的弹出窗口,然后单击底部以展开页面.
Now I am able to interact with the page, scroll down the page, close the popup that appears and click at the bottom to expand the page.
问题是当我统计物品时,代码只返回 20,应该是 40.
The problem is when I count the items, the code only returns 20 and it should be 40.
我一遍又一遍地检查代码 - 我遗漏了一些东西,但我不知道是什么.
I have checked the code again and again - I'm missing something but I don't know what.
请参阅下面的代码:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
#options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\\chromedriver.exe", options=options)
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
iter=1
while True:
scrollHeight = driver.execute_script("return document.documentElement.scrollHeight")
Height=10*iter
driver.execute_script("window.scrollTo(0, " + str(Height) + ");")
if Height > scrollHeight:
print('End of page')
break
iter+=1
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
time.sleep(3)
ver_mas = driver.find_elements_by_class_name('button-load-more')
for x in range(len(ver_mas)):
if ver_mas[x].is_displayed():
driver.execute_script("arguments[0].click();", ver_mas[x])
time.sleep(10)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div',class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
````=
What is wrong?. I newbie in the scraping world.
Regards
推荐答案
您的 while
和 for
语句无法按预期工作.
Your while
and for
statements don't work as intended.
- 使用
while True:
是一种不好的做法 - 你滚动到底部 - 但
button-load-more
按钮没有显示在那里 - Selenium 不会像显示的那样找到它 find_elements_by_class_name
- 查找多个元素 - 页面只有一个属于该类的元素if ver_mas[x].is_displayed():
如果幸运的话,它只会执行一次,因为范围是 1
- Using
while True:
is a bad practice - You scroll until the bottom - but the
button-load-more
button isn't displayed there - and Selenium will not find it as displayed find_elements_by_class_name
- looks for multiple elements - the page has only one element with that classif ver_mas[x].is_displayed():
if you are lucky this will be executed only once because the range is 1
您可以在下面找到解决方案 - 此处代码查找按钮,移动到该按钮而不是滚动,然后执行单击.如果代码未能找到按钮 - 意味着所有项目都已加载 - 它会中断 while
并向前移动.
Below you can find the solution - here the code looks for the button, moves to it instead of scrolling, and performs a click. If the code fails to found the button - meaning that all the items were loaded - it breaks the while
and moves forward.
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
iter = 1
while iter > 0:
time.sleep(3)
try:
ver_mas = driver.find_element_by_class_name('button-load-more')
actions = ActionChains(driver)
actions.move_to_element(ver_mas).perform()
driver.execute_script("arguments[0].click();", ver_mas)
except NoSuchElementException:
break
iter += 1
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div', class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
这篇关于使用 selenium 和 BeautifulSoup 抓取不会返回页面中的所有项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!