了解 Scrapy 中的回调 [英] Understanding callbacks in Scrapy
问题描述
我是 Python 和 Scrapy 的新手.我之前没有使用过回调函数.但是,我现在为下面的代码做.第一个请求将被执行,其响应将被发送到定义为第二个参数的回调函数:
I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
我无法理解以下内容:
item
是如何填充的?request.meta
行是否在parse_page2
中的response.meta
行之前执行?- 从
parse_page2
返回的item
去哪里了? parse_page1
中的return request
语句需要什么?我认为提取的项目需要从这里返回.
- How is the
item
populated? - Does the
request.meta
line executes before theresponse.meta
line inparse_page2
? - Where is the returned
item
fromparse_page2
going? - What is the need of the
return request
statement inparse_page1
? I thought the extracted items need to be returned from here.
推荐答案
阅读文档:
对于蜘蛛来说,抓取周期是这样的:
For spiders, the scraping cycle goes through something like this:
您首先生成初始请求以抓取第一个 URL,并指定要随响应调用的回调函数从这些请求下载.
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
要执行的第一个请求是通过调用start_requests()
方法(默认情况下)为在 start_urls
和 parse
方法中指定的 URL 作为回调请求函数.
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generates Request
for the
URLs specified in the start_urls
and the parse
method as callback
function for the Requests.
在回调函数中,您解析响应(网页)并返回 Item
对象、Request
对象或两者的可迭代对象.这些请求还将包含一个回调(可能相同)并且将然后由 Scrapy 下载,然后他们的响应由指定的回调.
In the callback function, you parse the response (web page) and return either Item
objects, Request
objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
在回调函数中,您解析页面内容,通常使用 Selectors(但您也可以使用 BeautifulSoup、lxml 或其他您喜欢的机制)并使用解析的数据生成项目.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
最后,从蜘蛛返回的项目通常会被持久化到数据库(在某些项目管道中)或写入文件使用Feed 导出.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
答案:
'item'
如何填充 request.meta
行在 parse_page2
中的 response.meta
行之前执行代码>?
How is the
'item'
populated does therequest.meta
line executes beforeresponse.meta
line inparse_page2
?
Spider 由 Scrapy 引擎管理.它首先从 start_urls
中指定的 URL 发出请求,并将它们传递给下载器.当下载完成时调用请求中指定的回调.如果回调返回另一个请求,则重复相同的事情.如果回调返回 Item
,则该项目将传递到管道以保存抓取的数据.
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls
and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item
, the item is passed to a pipeline to save the scraped data.
parse_page2
返回的item去哪里了?
Where is the returned item from
parse_page2
going?
parse_page1
中的return request
语句需要什么?我认为提取的项目需要从这里返回?
What is the need of return request
statement in parse_page1
? I thought the extracted items need to be returned from here ?
如文档中所述,每个回调(parse_page1
和 parse_page2
)都可以返回一个 Request
或一个 Item
(或其中的一个迭代).parse_page1
返回一个 Request
而不是 Item
,因为需要从附加 URL 中获取附加信息.第二个回调 parse_page2
返回一个项目,因为所有信息都被抓取并准备好传递给管道.
As stated in the docs, each callback (both parse_page1
and parse_page2
) can return either a Request
or an Item
(or an iterable of them). parse_page1
returns a Request
not the Item
, because additional info needs to be scraped from additional URL. Second callback parse_page2
returns an item, because all the info is scraped and ready to be passed to a pipeline.
这篇关于了解 Scrapy 中的回调的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!