scrapy:了解项目和请求如何在回调之间工作 [英] scrapy: understanding how do items and requests work between callbacks
问题描述
我在 Scrapy 上苦苦挣扎,我不明白在回调之间传递项目是如何工作的.也许有人可以帮助我.
I'm struggling with Scrapy and I don't understand how exactly passing items between callbacks works. Maybe somebody could help me.
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
我正在尝试逐步了解那里的操作流程:
I'm trying to understand flow of actions there, step by step:
[parse_page1]
[parse_page1]
item = MyItem()
<- 对象 item 被创建item['main_url'] = response.url
<- 我们正在为对象 item 的 main_url 赋值request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2)
<- 我们正在请求一个新页面并启动 parse_page2把它报废.
item = MyItem()
<- object item is createditem['main_url'] = response.url
<- we are assigning value to main_url of object itemrequest = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2)
<- we are requesting a new page and launching parse_page2 to scrap it.
[parse_page2]
[parse_page2]
item = response.meta['item']
<- 我不明白这里.我们正在创建一个新的对象项还是这是在 [parse_page1] 中创建的对象项?response.meta['item'] 是什么意思?我们仅将链接和回调等信息传递给 3 中的请求,我们没有添加任何可以引用的其他参数......item['other_url'] = response.url
<- 我们正在为对象 item 的 other_url 赋值return item
<- 我们正在根据请求返回 item 对象
item = response.meta['item']
<- I don't understand here. We are creating a new object item or this is the object item created in [parse_page1]? And what response.meta['item'] does mean? We pass to the request in 3 only information like link and callback we didn't add any additional arguments to which we could refer ...item['other_url'] = response.url
<- we are assigning value to other_url of object itemreturn item
<- we are returning item object as a result of request
[parse_page1]
[parse_page1]
request.meta['item'] = item
<- 我们正在为请求分配对象 item?但是请求完成了,回调已经返回了6 ????返回请求
<-我们正在获取请求的结果,所以来自 6 的项目,我说得对吗?
request.meta['item'] = item
<- We are assigning object item to request? But request is finished, callback already returned item in 6 ????return request
<- we are getting results of request, so item from 6, am I right?
我浏览了所有关于scrapy和request/response/meta的文档,但我仍然不明白第4点和第7点发生了什么.
I went through all documentation concerning scrapy and request/response/meta but still I don't understand what is happening here in points 4 and 7.
推荐答案
line 4: request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
line 5: request.meta['item'] = item
line 6: return request
你对前面的代码感到困惑,我来解释一下(我这里列举了解释):
You are confused about the previous code, let me explain it (I enumerated to explain it here):
在第 4 行中,您正在实例化一个
scrapy.Request
对象,这与其他请求库不同,这里您没有调用 url,也不会转到回调函数.
In line 4, you are instantiating a
scrapy.Request
object, this doesn't work like other other requests libraries, here you are not calling the url, and not going to the callback function just yet.
您正在向第 5 行的 scrapy.Request
对象添加参数,因此例如您还可以声明 scrapy.Request
对象,如:>
You are adding arguments to the scrapy.Request
object in line 5, so for example you could also declare the scrapy.Request
object like:
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
你可以避免第 5 行.
and you could have avoided line 5.
当你调用 scrapy.Request
对象时在第 6 行,当 scrapy
让它工作时,比如调用指定的 url,去下面的回调,并通过它传递 meta
,如果你像这样调用请求,你也可以避免第 6 行(和第 5 行):
Is in line 6 when you are calling the scrapy.Request
object, and when scrapy
is making it work, like calling the url specified, going to the following callback, and passing meta
with it, you coul have also avoided line 6 (and line 5) if you would have called the request like this:
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2, meta={'item': item})`
所以这里的想法是你的回调方法应该return
(最好是yield
)一个Request
或和Item
,scrapy会输出Item
并继续爬取Request
.
So the idea here is that your callback methods should return
(preferably yield
) a Request
or and Item
, scrapy will output the Item
and continue crawling the Request
.
这篇关于scrapy:了解项目和请求如何在回调之间工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!