scrapy:了解项目和请求如何在回调之间工作 [英] scrapy: understanding how do items and requests work between callbacks

查看:36
本文介绍了scrapy:了解项目和请求如何在回调之间工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Scrapy 上苦苦挣扎,我不明白在回调之间传递项目是如何工作的.也许有人可以帮助我.

I'm struggling with Scrapy and I don't understand how exactly passing items between callbacks works. Maybe somebody could help me.

我正在研究 http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

我正在尝试逐步了解那里的操作流程:

I'm trying to understand flow of actions there, step by step:

[parse_page1]

[parse_page1]

  1. item = MyItem() <- 对象 item 被创建
  2. item['main_url'] = response.url <- 我们正在为对象 item 的 main_url 赋值
  3. request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) <- 我们正在请求一个新页面并启动 parse_page2把它报废.
  1. item = MyItem() <- object item is created
  2. item['main_url'] = response.url <- we are assigning value to main_url of object item
  3. request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) <- we are requesting a new page and launching parse_page2 to scrap it.

[parse_page2]

[parse_page2]

  1. item = response.meta['item'] <- 我不明白这里.我们正在创建一个新的对象项还是这是在 [parse_page1] 中创建的对象项?response.meta['item'] 是什么意思?我们仅将链接和回调等信息传递给 3 中的请求,我们没有添加任何可以引用的其他参数......
  2. item['other_url'] = response.url <- 我们正在为对象 item 的 other_url 赋值
  3. return item <- 我们正在根据请求返回 item 对象
  1. item = response.meta['item'] <- I don't understand here. We are creating a new object item or this is the object item created in [parse_page1]? And what response.meta['item'] does mean? We pass to the request in 3 only information like link and callback we didn't add any additional arguments to which we could refer ...
  2. item['other_url'] = response.url <- we are assigning value to other_url of object item
  3. return item <- we are returning item object as a result of request

[parse_page1]

[parse_page1]

  1. request.meta['item'] = item <- 我们正在为请求分配对象 item?但是请求完成了,回调已经返回了6 ????
  2. 返回请求 <-我们正在获取请求的结果,所以来自 6 的项目,我说得对吗?
  1. request.meta['item'] = item <- We are assigning object item to request? But request is finished, callback already returned item in 6 ????
  2. return request <- we are getting results of request, so item from 6, am I right?

我浏览了所有关于scrapy和request/response/meta的文档,但我仍然不明白第4点和第7点发生了什么.

I went through all documentation concerning scrapy and request/response/meta but still I don't understand what is happening here in points 4 and 7.

推荐答案

line 4: request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
line 5: request.meta['item'] = item
line 6: return request

你对前面的代码感到困惑,我来解释一下(我这里列举了解释):

You are confused about the previous code, let me explain it (I enumerated to explain it here):

  1. 在第 4 行中,您正在实例化一个 scrapy.Request 对象,这与其他请求库不同,这里您没有调用 url,也不会转到回调函数.

  1. In line 4, you are instantiating a scrapy.Request object, this doesn't work like other other requests libraries, here you are not calling the url, and not going to the callback function just yet.

您正在向第 5 行的 scrapy.Request 对象添加参数,因此例如您还可以声明 scrapy.Request 对象,如:

You are adding arguments to the scrapy.Request object in line 5, so for example you could also declare the scrapy.Request object like:

request = scrapy.Request("http://www.example.com/some_page.html", 
        callback=self.parse_page2, meta={'item': item})`

你可以避免第 5 行.

and you could have avoided line 5.

当你调用 scrapy.Request 对象时在第 6 行,当 scrapy 让它工作时,比如调用指定的 url,去下面的回调,并通过它传递 meta,如果你像这样调用请求,你也可以避免第 6 行(和第 5 行):

Is in line 6 when you are calling the scrapy.Request object, and when scrapy is making it work, like calling the url specified, going to the following callback, and passing meta with it, you coul have also avoided line 6 (and line 5) if you would have called the request like this:

return scrapy.Request("http://www.example.com/some_page.html", 
        callback=self.parse_page2, meta={'item': item})`

所以这里的想法是你的回调方法应该return(最好是yield)一个Request或和Item,scrapy会输出Item并继续爬取Request.

So the idea here is that your callback methods should return (preferably yield) a Request or and Item, scrapy will output the Item and continue crawling the Request.

这篇关于scrapy:了解项目和请求如何在回调之间工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆