无法从网页获取所有链接 [英] Not getting all links from webpage

查看:34
本文介绍了无法从网页获取所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个网页抓取项目.我正在抓取的网站的 URL 是 https://www.beliani.de/sofas/ledersofa/

I am working on a Web scraping project. The URL for the website I am scraping is https://www.beliani.de/sofas/ledersofa/

我正在抓取此页面上列出的所有产品链接.我尝试使用 Requests-HTMLSelenium 获取链接.但是我分别得到了 57 个和 24 个链接.虽然页面上列出了 150 多种产品.下面是我正在使用的代码块.

I am scraping all the links of products listed on this page. I tried getting links using both Requests-HTML and Selenium. But I am getting 57 and 24 links respectively. While there are more than 150 products listed on the page. Below are the code blocks I am using.

使用硒:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

options = Options()
options.add_argument("user-agent = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36")

#path to crome driver
DRIVER_PATH = 'C:/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)

url = 'https://www.beliani.de/sofas/ledersofa/'

driver.get(url)
sleep(20)

links = []
for a in driver.find_elements_by_xpath('//*[@id="offers_div"]/div/div/a'):
    print(a)
    links.append(a)
print(len(links))

使用请求 HTML:

from requests_html import HTMLSession

url = 'https://www.beliani.de/sofas/ledersofa/'

s = HTMLSession()
r = s.get(url)

r.html.render(sleep = 20)

products = r.html.xpath('//*[@id="offers_div"]', first = True)

#Getting 57 links using below block:
links = []
for link in products.absolute_links:
    print(link)
    links.append(link)

print(len(links))

我不知道我做错了哪一步或遗漏了什么.

I am not getting which step I am doing wrong or what is missing.

推荐答案

您必须滚动浏览网站并到达页面末尾才能加载所有脚本 在网页中.只需打开网站,我们将仅加载查看网页特定部分所需的脚本.因此,当您运行代码时,它只能从那些已加载的脚本中检索数据.

You have to scroll through the website and reach the end of the page in order to load all the scripts in the webpage. Just by opening the website we will load only the script that is necessary to view that particular section of the webpage. Therefore when you ran your code it could retrieve data from only those scripts that were loaded.

这个给了我 160 个链接:

This one gave me 160 links :

driver.get('https://www.beliani.de/sofas/ledersofa/')
sleep(3)

#gets the whole height of the document
height = driver.execute_script('return document.body.scrollHeight')

# now break the webpage into parts so that each section in the page is scrolled through to load
scroll_height = 0
for i in range(10):
    scroll_height = scroll_height + (height/10)
    driver.execute_script('window.scrollTo(0,arguments[0]);',scroll_height)
    sleep(2)

# I have used the 'class' locator you can use anything you want once we have completed the loop
a_tags = driver.find_elements_by_class_name('itemBox')
count = 0
for i in a_tags:
    if i.get_attribute('href') is not None:
        print(i.get_attribute('href'))
        count+=1

print(count)
driver.quit()

这篇关于无法从网页获取所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆