Scrapy:如何捕捉下载错误并再次尝试下载 [英] Scrapy: how to catch download error and try download it again

查看:39
本文介绍了Scrapy:如何捕捉下载错误并再次尝试下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的抓取过程中,由于意外重定向,某些页面失败并且没有返回任何响应.如何捕获此类错误并使用原始 url 而不是重定向的 url 重新安排请求?

在我在这里提问之前,我使用 Google 进行了大量搜索.看起来有两种方法可以解决这个问题.一种是在下载中间件中捕获异常,另一种是在spider请求的errback中处理下载异常.对于这两个问题,我有几个问题.

  • 对于方法 1,我不知道如何将原始 url 传递给 process_exception 函数.以下是我尝试过的示例代码.
<块引用>

class ProxyMiddleware(object):def process_request(self, request, spider):request.meta['proxy'] = "http://192.168.10.10"log.msg('>>>> Proxy %s'%(request.meta['proxy'] if request.meta['proxy'] else ""), level=log.DEBUG)def process_exception(自我,请求,异常,蜘蛛):log_msg('无法使用代理 %s 请求 url %s,异常 %s' % (request.url, proxy if proxy else 'nil', str(exception)))#再试一次.退货要求

  • 对于方法2,我不知道如何将外部参数传递给蜘蛛中的errback函数.我不知道如何从这个 errback 函数中检索原始 url 以重新安排请求.

    以下是我尝试使用方法 2 的示例:

<块引用>

class ProxytestSpider(Spider):名称 = "代理测试"allowed_domains = ["baidu.com"]start_urls = ('http://www.baidu.com/',)def make_requests_from_url(self, url):起始网址 = 网址request = Request(url, dont_filter=True,callback = self.parse, errback = self.download_errback)打印提出请求"退货要求定义解析(自我,响应):经过打印在解析函数中"def download_errback(self, e):打印类型(e),repr(e)打印代表(e.value)打印在下载错误回调中"

对此重新抓取问题的任何建议表示高度赞赏.提前致谢.

问候

必应

解决方案

您可以将 lambda 作为错误返回:

request = Request(url, dont_filter=True,callback = self.parse, errback = lambda x: self.download_errback(x, url))

这样你就可以访问 errback 函数中的 url:

def download_errback(self, e, url):打印网址

During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url?

Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions.

  • For method 1, I don't know how to pass the original url to process_exception function. Below is the example code I have tried.

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] = "http://192.168.10.10"
        log.msg('>>>> Proxy %s'%(request.meta['proxy'] if request.meta['proxy'] else ""), level=log.DEBUG)
    def process_exception(self, request, exception, spider):
        log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))
        #retry again.
        return request

  • For method 2, I don't know how to pass external parameter to errback function in the spider. I don't know how to retrieve original url from this errback function to re-schedule a request.

    Below is the example I tried with method 2:

class ProxytestSpider(Spider):

    name = "proxytest"
    allowed_domains = ["baidu.com"]
    start_urls = (
        'http://www.baidu.com/',
        )
    def make_requests_from_url(self, url):
        starturl = url
        request = Request(url, dont_filter=True,callback = self.parse, errback = self.download_errback)
        print "make requests"
        return request
    def parse(self, response):
        pass
        print "in parse function"        
    def download_errback(self, e):
        print type(e), repr(e)
        print repr(e.value)
        print "in downloaderror_callback"

Any suggestion for this recrawl issue is highly appreciated. Thanks in advance.

Regards

Bing

解决方案

You could pass a lambda as an errback:

request = Request(url, dont_filter=True,callback = self.parse, errback = lambda x: self.download_errback(x, url))

that way you'll have access to the url inside the errback function:

def download_errback(self, e, url):
    print url

这篇关于Scrapy:如何捕捉下载错误并再次尝试下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆