如何使用BeautifulSoup查找所有下一个链接 [英] How to use BeautifulSoup to find all the next links

查看:59
本文介绍了如何使用BeautifulSoup查找所有下一个链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在通过预置名为number_of_pages的变量来抓取特定网站的所有页面.预设此变量将起作用,直到添加了一个我不知道的新页面为止.例如,下面的代码是3页,但是网站现在有4页.

I'm currently scraping all the page of a specific website by presetting a variable called number_of_pages. Presetting this variable works until a new page is added that I don't know about. For example the code below is for 3 pages, but the website now has 4 pages.

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
   url_to_scrape = (base_url + str(i))

我想使用BeautifulSoup在网站上找到所有接下来要抓取的链接.下面的代码找到第二个URL,但找不到第三个或第四个URL.如何在抓取所有页面之前建立它们的列表?

I would like to use BeautifulSoup to find all the next links on the website to scrape. The code below finds the second URL, but not the third or fourth. How do I build a list of all the pages prior to scraping them?

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
  nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)

推荐答案

有几种不同的方法可以实现分页.这是其中之一.

There are several different ways to approach the pagination. Here is one of them.

这个想法是初始化一个无穷循环并在没有下一个"链接时将其中断:

from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests


with requests.Session() as session:
    page_number = 1
    url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
    while True:
        print("Processing page: #{page_number}; url: {url}".format(page_number=page_number, url=url))
        response = session.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')

        # check if there is next page, break if not
        next_link = soup.find("a", text="next")
        if next_link is None:
            break

        url = urljoin(url, next_link["href"])
        page_number += 1

print("Done.")

如果执行它,您将看到打印以下消息:

If you execute it, you will see the following messages printed:

Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.

请注意,为了提高性能并在请求中保留cookie,我们正在使用

Note that, to improve on performance and persist cookies across the requests, we are maintaining a web-scraping session with requests.Session.

这篇关于如何使用BeautifulSoup查找所有下一个链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆