如何使用Beautifulsoup在python中刮下下一页 [英] How to scrape the next pages in python using Beautifulsoup

查看：633 发布时间：2018/6/26 21:41:38 python html web-scraping beautifulsoup html-parsing

本文介绍了如何使用Beautifulsoup在python中刮下下一页的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我正在抓取网址

  http://www.engineering.careers360.com/colleges/list-of -engineering-colleges-in-India？sort_filter = alpha

它的内容no包含我想刮掉的数据。那么我怎样才能抓取所有下一页的数据。
我正在使用python 3.5.1和Beautifulsoup。
注意：我不能使用scrapy和lxml，因为它给了我一些安装错误。

确定最后一页通过提取转到最后一页元素的页面参数。并通过 requests.Session（） ：

import re 从bs4导入请求导入BeautifulSoup with requests.Session（）作为会话：＃提取最后一页响应= session.get（http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha） soup = BeautifulSoup（response.content， html.parser） last_page = int（re.search（page =（\d +），soup.select_one（li.pager-last）。a [href]）。group 1））＃循环遍历每页用于范围内的页面（last_page）： response = session.get（http://www.engineering.careers360.com /院校/印度工程院校列表？sort_filter = alpha& page =％f％page）汤= BeautifulSoup（response.content，html.parser ）＃打印每个搜索结果的标题用于结果soup.select（li.search-result）： title = result.find（div ，class _ =title）。get_text（strip = True） print（title）
打印：

班加罗尔ACS工程学院 A1全球工程和技术学院，Prakasam AAA工程技术学院，Thiruthangal ...

Suppose I am scraping a url
http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha
and it contents no of pages which contains the data which I want to scrape. So how can I scrape the data of all the next pages. I am using python 3.5.1 and Beautifulsoup. Note: I can't use scrapy and lxml as it is giving me some installation error.
解决方案
Determine the last page by extracting the page argument of the "Go to the last page" element. And loop over every page maintaining a web-scraping session via requests.Session():
import re import requests from bs4 import BeautifulSoup with requests.Session() as session: # extract the last page response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha") soup = BeautifulSoup(response.content, "html.parser") last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1)) # loop over every page for page in range(last_page): response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page) soup = BeautifulSoup(response.content, "html.parser") # print the title of every search result for result in soup.select("li.search-result"): title = result.find("div", class_="title").get_text(strip=True) print(title)
Prints:
A C S College of Engineering, Bangalore A1 Global Institute of Engineering and Technology, Prakasam AAA College of Engineering and Technology, Thiruthangal ...

这篇关于如何使用Beautifulsoup在python中刮下下一页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Beautifulsoup在python中刮下下一页 [英] How to scrape the next pages in python using Beautifulsoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何使用Beautifulsoup在python中刮下下一页 [英] How to scrape the next pages in python using Beautifulsoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭