递归使用 Scrapy 从网站上抓取网页 [英] Recursive use of Scrapy to scrape webpages from a website
问题描述
我最近开始使用 Scrapy.我试图从一个大列表中收集一些信息,该列表分为几页(大约 50 页).我可以轻松地从第一页中提取我想要的内容,包括 start_urls
列表中的第一页.但是,我不想将这 50 个页面的所有链接添加到此列表中.我需要一种更动态的方式.有谁知道我如何迭代抓取网页?有没有人有这方面的例子?
I have recently started to work with Scrapy. I am trying to gather some info from a large list which is divided into several pages(about 50). I can easily extract what I want from the first page including the first page in the start_urls
list. However I don't want to add all the links to these 50 pages to this list. I need a more dynamic way. Does anyone know how I can iteratively scrape web pages? Does anyone have any examples of this?
谢谢!
推荐答案
使用 urllib2 下载页面.然后使用 re(正则表达式)或 BeautifulSoup(一个 HTML 解析器)找到指向您需要的下一页的链接.用 urllib2 下载它.冲洗并重复.
use urllib2 to download a page. Then use either re (regular expressions) or BeautifulSoup (an HTML parser) to find the link to the next page you need. Download that with urllib2. Rinse and repeat.
Scapy 很棒,但你不需要它来做你想做的事情
Scapy is great, but you dont need it to do what you're trying to do
这篇关于递归使用 Scrapy 从网站上抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!