如何使用scrapy抓取多个页面? [英] How to use scrapy to crawl multiple pages?
本文介绍了如何使用scrapy抓取多个页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我发现的所有 Scrapy 示例都讨论了如何抓取单个页面、具有相同 url 架构的页面或网站的所有页面.我需要抓取一系列页面 A、B、C,其中在 A 中你得到了 B 的链接等等.例如网站结构是:
All examples i found of Scrapy talk about how to crawl a single page, pages with the same url schema or all the pages of a website. I need to crawl series of pages A, B, C where in A you got the link to B and so on.. For example the website structure is:
A
----> B
---------> C
D
E
我需要抓取所有的 C 页面,但要获得指向 C 的链接,我需要在 A 和 B 之前抓取.有什么提示吗?
I need to crawl all the C pages, but to get link to C i need to crawl before A and B. Any hints?
推荐答案
参见 scrapy 请求结构,要抓取这样的链,您必须使用如下回调参数:
see scrapy Request structure, to crawl such chain you'll have to use the callback parameter like the following:
class MySpider(BaseSpider):
...
# spider starts here
def parse(self, response):
...
# A, D, E are done in parallel, A -> B -> C are done serially
yield Request(url=<A url>,
...
callback=parseA)
yield Request(url=<D url>,
...
callback=parseD)
yield Request(url=<E url>,
...
callback=parseE)
def parseA(self, response):
...
yield Request(url=<B url>,
...
callback=parseB)
def parseB(self, response):
...
yield Request(url=<C url>,
...
callback=parseC)
def parseC(self, response):
...
def parseD(self, response):
...
def parseE(self, response):
...
这篇关于如何使用scrapy抓取多个页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文