递归使用 Scrapy 从网站上抓取网页 [英] Recursive use of Scrapy to scrape webpages from a website

查看:42
本文介绍了递归使用 Scrapy 从网站上抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用 Scrapy.我试图从一个大列表中收集一些信息,该列表分为几页(大约 50 页).我可以轻松地从第一页中提取我想要的内容,包括 start_urls 列表中的第一页.但是,我不想将这 50 个页面的所有链接添加到此列表中.我需要一种更动态的方式.有谁知道我如何迭代抓取网页?有没有人有这方面的例子?

I have recently started to work with Scrapy. I am trying to gather some info from a large list which is divided into several pages(about 50). I can easily extract what I want from the first page including the first page in the start_urls list. However I don't want to add all the links to these 50 pages to this list. I need a more dynamic way. Does anyone know how I can iteratively scrape web pages? Does anyone have any examples of this?

谢谢!

推荐答案

使用 urllib2 下载页面.然后使用 re(正则表达式)或 BeautifulSoup(一个 HTML 解析器)找到指向您需要的下一页的链接.用 urllib2 下载它.冲洗并重复.

use urllib2 to download a page. Then use either re (regular expressions) or BeautifulSoup (an HTML parser) to find the link to the next page you need. Download that with urllib2. Rinse and repeat.

Scapy 很棒,但你不需要它来做你想做的事情

Scapy is great, but you dont need it to do what you're trying to do

这篇关于递归使用 Scrapy 从网站上抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆