如何使用BeautifulSoup遍历网站的每个页面以进行Web抓取 [英] How to loop through each page of website for web scraping with BeautifulSoup

查看:193
本文介绍了如何使用BeautifulSoup遍历网站的每个页面以进行Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup从网站上抓取职位发布数据.我有可以满足我需要的工作代码,但它只会抓取职位发布的第一页.我在弄清楚如何迭代更新URL以刮擦每个页面时遇到了麻烦.我是Python的新手,曾经研究过类似问题的几种不同解决方案,但还没有弄清楚如何将其应用于我的特定url.我认为我需要迭代更新URL或以某种方式单击下一步"按钮,然后在每个页面中循环我现有的代码.我感谢所有解决方案.

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.

url: https://jobs.utcaerospacesystems.com/search-jobs

推荐答案

首先,BeautifulSoup与获取网页没有任何关系-您可以自己获取网页,然后将其提供给bs4进行处理.

First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.

您链接的页面的问题在于它是javascript-仅在浏览器(或任何其他javascript VM)中正确显示.

The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).

@Fabricator处在正确的轨道上-您需要观察开发人员控制台,并查看ajax请求js将其发送到服务器的内容.在这种情况下,还要看看查询字符串参数,其中包括一个名为CurrentPage的参数-可能是您要关注的参数.

@Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

这篇关于如何使用BeautifulSoup遍历网站的每个页面以进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆