如何在 Scrapy 上同步获取 Request 的 Response 对象? [英] How to fetch the Response object of a Request synchronously on Scrapy?

查看:43
本文介绍了如何在 Scrapy 上同步获取 Request 的 Response 对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我相信使用回调"方法是异步的,如果我错了,请纠正我.我还是 Python 新手,所以请耐心等待.

I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.

无论如何,我正在尝试创建一种方法来检查文件是否存在,这是我的代码:

Anyway, I'm trying to make a method to check if a file exists and here is my code:

def file_exists(self, url):
    res = False;
    response = Request(url, method='HEAD', dont_filter=True)
    if response.status == 200:
        res = True
    return res

我以为 Request() 方法会返回一个 Response 对象,但它仍然返回一个 Request 对象,为了捕获 Response,我必须为回调创建一个不同的方法.

I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.

有没有办法在调用 Response() 方法的代码块中获取 Response 对象?

Is there a way to get the Response object within the code block where you call the Response() method?

推荐答案

Request 对象不生成任何东西.

Request objects don't generate anything.

Scrapy 使用异步下载引擎,该引擎获取这些 Request 对象并生成 Response 对象.

Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.

如果你的蜘蛛中的任何方法返回一个请求对象,它会自动在下载器中调度并返回一个Response对象到指定的callback(即Request(url), callback=self.my_callback)).在 scrapy 的架构概述

if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)). Check out more at scrapy's architecture overview

现在取决于您在何时何地执行此操作,您可以通过告诉下载器安排一些请求来安排请求:

Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:

self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider) 

如果你从蜘蛛 spider 运行它,这里很可能是 self 并且 self.crawler 继承自 scrapy.Spider.

If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.

或者,您始终可以通过使用诸如 requests 之类的东西来阻塞异步堆栈,例如:

Alternatively you can always block asynchronous stack by using something like requests like:

def parse(self, response):
    image_url = response.xpath('//img/@href').extract_first()
    if image_url:
        image_head = requests.head(image_url)
        if 'image' in image_head.headers['Content-Type']:
            item['image'] = image_url

它会减慢您的爬虫速度,但实施和管理要容易得多.

It will slow your spider down but it's significantly easier to implement and manage.

这篇关于如何在 Scrapy 上同步获取 Request 的 Response 对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆