如何使用BeautifulSoup遍历网站的每个页面以进行Web抓取 [英] How to loop through each page of website for web scraping with BeautifulSoup
问题描述
我正在使用BeautifulSoup从网站上抓取职位发布数据.我有可以满足我需要的工作代码,但它只会抓取职位发布的第一页.我在弄清楚如何迭代更新URL以刮擦每个页面时遇到了麻烦.我是Python的新手,曾经研究过类似问题的几种不同解决方案,但还没有弄清楚如何将其应用于我的特定url.我认为我需要迭代更新URL或以某种方式单击下一步"按钮,然后在每个页面中循环我现有的代码.我感谢所有解决方案.
I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs >
推荐答案
首先,BeautifulSoup与获取网页没有任何关系-您可以自己获取网页,然后将其提供给bs4进行处理.
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
您链接的页面的问题在于它是javascript-仅在浏览器(或任何其他javascript VM)中正确显示.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
@Fabricator处在正确的轨道上-您需要观察开发人员控制台,并查看ajax请求js将其发送到服务器的内容.在这种情况下,还要看看查询字符串参数,其中包括一个名为CurrentPage
的参数-可能是您要关注的参数.
@Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage
- that's probably the one you want to focus on.
这篇关于如何使用BeautifulSoup遍历网站的每个页面以进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!