如何使用Beautifulsoup在python中刮下下一页 [英] How to scrape the next pages in python using Beautifulsoup

查看:633
本文介绍了如何使用Beautifulsoup在python中刮下下一页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在抓取网址

  http://www.engineering.careers360.com/colleges/list-of -engineering-colleges-in-India?sort_filter = alpha 

它的内容no包含我想刮掉的数据。那么我怎样才能抓取所有下一页的数据。
我正在使用python 3.5.1和Beautifulsoup。
注意:我不能使用scrapy和lxml,因为它给了我一些安装错误。

确定最后一页通过提取转到最后一页元素的页面参数。并通过 requests.Session()

  import re 

从bs4导入请求
导入BeautifulSoup


with requests.Session()作为会话:
#提取最后一页
响应= session.get(http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha)
soup = BeautifulSoup(response.content, html.parser)
last_page = int(re.search(page =(\d +),soup.select_one(li.pager-last)。a [href])。group 1))

#循环遍历每页
用于范围内的页面(last_page):
response = session.get(http://www.engineering.careers360.com /院校/印度工程院校列表?sort_filter = alpha& page =%f%page)
汤= BeautifulSoup(response.content,html.parser )

#打印每个搜索结果的标题
用于结果soup.select(li.search-result):
title = result.find(div ,class _ =title)。get_text(strip = True)
print(title)

打印:

 班加罗尔ACS工程学院
A1全球工程和技术学院,Prakasam
AAA工程技术学院,Thiruthangal
...


Suppose I am scraping a url

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

and it contents no of pages which contains the data which I want to scrape. So how can I scrape the data of all the next pages. I am using python 3.5.1 and Beautifulsoup. Note: I can't use scrapy and lxml as it is giving me some installation error.

解决方案

Determine the last page by extracting the page argument of the "Go to the last page" element. And loop over every page maintaining a web-scraping session via requests.Session():

import re

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    # extract the last page
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")    
    soup = BeautifulSoup(response.content, "html.parser")
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))

    # loop over every page
    for page in range(last_page):
        response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
        soup = BeautifulSoup(response.content, "html.parser")

        # print the title of every search result
        for result in soup.select("li.search-result"):
            title = result.find("div", class_="title").get_text(strip=True)
            print(title)

Prints:

A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...

这篇关于如何使用Beautifulsoup在python中刮下下一页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆