如何处理scrapy项目中的各种异常,在errback和回调中? [英] how to process all kinds of exception in a scrapy project, in errback and callback?

查看:33
本文介绍了如何处理scrapy项目中的各种异常,在errback和回调中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在从事一个刮板项目,这对于确保每个请求都得到正确处理非常重要,即记录错误或保存成功的结果.我已经实现了基本的爬虫,现在我可以成功处理 99% 的请求,但是我可能会得到像验证码、50x、30x 这样的错误,甚至结果中没有足够的字段(然后我会尝试另一个网站找到缺失的字段).

I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result. I've already implemented the basic spider, and I can now process 99% of the requests successfully, but I could get errors like captcha, 50x, 30x, or even no enough fields in the result(then I'll try another website to find the missing fields).

起初,我认为在解析回调中引发异常并在 errback 中处理它们更合乎逻辑",这可以使代码更具可读性.但我只是试图找出 errback 只能捕获下载器模块中的错误,例如非 200 响应状态.如果我在回调中引发自实现的 ParseError,蜘蛛只会引发它并停止.

At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.

即使我必须直接在回调中处理解析请求,我也不知道如何以干净的方式在回调中立即重试请求.你知道,我可能需要包含一个不同的代理来发送另一个请求,或者修改一些请求头.

Even if I'll have to process the parsing request directly in the callback, I don't know how to retry the request immediately in the callback in a clean fashion. u know, I may have to include a different proxy to send another request, or modify some request header.

我承认我对scrapy比较陌生,但我已经反复尝试了好几天,但仍然无法使其正常工作......我已经检查了SO上的每一个问题,没有一个匹配,在此先感谢您的帮助.

I admit I'm relatively new to scrapy but I've tried back and forth for days and still cannot get this to working… I've checked every single question on SO and no one matches, thanks in advance for the help.

更新:我意识到这可能是一个非常复杂的问题,因此我尝试在以下伪代码中说明该场景,希望对您有所帮助:

UPDATE: I realize this could be a very complex question so I try to illustrate the scenario in the following pseudo code, hope this helps:

from scraper.myexceptions import *

def parseRound1(self, response):

    .... some parsing routines ...
    if something wrong happened:
       # this causes the spider raises a SpiderException and stops
       raise CaptchaError
    ...

    if no enough fields scraped:
       raise ParseError(task, "no enough fields")
    else:
       return items

def parseRound2(self, response):
    ...some other parsing routines...

def errHandler(self, failure):
    # how to trap all the exceptions?
    r = failure.trap()
    # cannot trap ParseError here
    if r == CaptchaError:
       # how to enqueue the original request here?
       retry
    elif r == ParseError:
        if raised from parseRound1:
            new request for Round2
        else:
            some other retry mechanism
    elif r == HTTPError:
       ignore or retry

推荐答案

EDIT 16 nov 2012: Scrapy >=0.16 使用不同的方法将方法附加到信号,添加了额外的示例

EDIT 16 nov 2012: Scrapy >=0.16 uses a different method to attach methods to signals, extra example added

最简单的解决方案是编写一个扩展,在其中使用 Scrapy 信号捕获故障.例如;以下扩展程序将捕获所有错误并打印回溯.

The most simple solution would be to write an extension in which you capture failures, using Scrapy signals. For example; the following extension will catch all errors and print a traceback.

你可以对失败做任何事情——比如保存到你的数据库,或者发送电子邮件——它本身就是一个 twisted.python.failure.Failure.

You could do anything with the Failure - like save to your database, or send an email - which itself is an instance of twisted.python.failure.Failure.

对于 0.16 之前的 Scrapy 版本:

For Scrapy versions till 0.16:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class FailLogger(object):
  def __init__(self):
    """ 
    Attach appropriate handlers to the signals
    """
    dispatcher.connect(self.spider_error, signal=signals.spider_error)

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())

对于 0.16 及更高版本的 Scrapy:

For Scrapy versions from 0.16 and up:

from scrapy import signals

class FailLogger(object):

  @classmethod
  def from_crawler(cls, crawler):
    ext = cls()

    crawler.signals.connect(ext.spider_error, signal=signals.spider_error)

    return ext

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())  

您可以在设置中启用扩展,例如:

You would enable the extension in the settings, with something like:

EXTENSIONS = {
'spiders.extensions.faillog.FailLogger': 599,
}

这篇关于如何处理scrapy项目中的各种异常,在errback和回调中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆