了解Scrapy中的回调 [英] Understanding callbacks in Scrapy

查看:164
本文介绍了了解Scrapy中的回调的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python和Scrapy的新手。我以前没有使用过回调函数。但是,我现在为下面的代码做。将执行第一个请求,并将其响应发送到定义为第二个参数的回调函数:

I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

我无法理解以下内容:


  1. 怎么样项目已填充?

  2. request.meta 行是否在响应之前执行 parse_page2 中的.meta 行?

  3. 返回的项在哪里 parse_page2 开始?

  4. ret有什么需要 parse_page1 中的urn请求语句?我认为提取的项目需要从这里返回。

  1. How is the item populated?
  2. Does the request.meta line executes before the response.meta line in parse_page2?
  3. Where is the returned item from parse_page2 going?
  4. What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.


推荐答案

阅读文档


对于蜘蛛,抓取周期会经历以下过程:

For spiders, the scraping cycle goes through something like this:


  1. 首先要生成初始请求以抓取第一个URL ,然后指定要调用的回调函数,并从这些请求中下载响应

  1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

要执行的第一个请求是通过调用
获得的 start_requests()方法(默认情况下)为<$ c中指定的
URL生成 Request $ c> start_urls 和 parse 方法作为请求的回调
函数。

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

在回调函数中,您解析响应(网页)并返回 Item 对象, Request 对象或可迭代两者。
这些请求还将包含一个回调(可能是相同的),然后Scrapy将下载
,然后由
指定的回调处理它们的响应。

In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

在回调函数中,通常使用 Selectors 来解析页面内容(但您也可以使用BeautifulSoup,lxml或您喜欢的任何
机制)并生成带有解析数据的项目。

In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

最后,通常将从蜘蛛返回的项目持久化到数据库中(在某些 Item Pipeline 中>)或使用 Feed出口写入文件

Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.


答案:


'item' request.meta 行在 parse_page2 response.meta 行之前执行吗c $ c>?

How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?

蜘蛛由Scrapy引擎管理。它首先从 start_urls 中指定的URL发出请求,然后将其传递给下载程序。下载完成后,将调用请求中指定的回调。如果回调返回另一个请求,则重复同样的事情。如果回调函数返回 Item ,则该项目将传递到管道以保存抓取的数据。

Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.


parse_page2 去哪儿了?

<$ c需要什么 parse_page1 中的$ c>返回请求语句?我认为提取的项目需要从这里返回?

What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?

如文档中所述,每个回调(均 parse_page1 parse_page2 )可以返回 Request Item (或它们的可迭代对象)。 parse_page1 返回一个请求而不是 Item ,因为附加信息需要从其他网址中抓取。第二个回调 parse_page2 返回一个项目,因为所有信息都已被抓取并准备好传递到管道中。

As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.

这篇关于了解Scrapy中的回调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆