Scrapy:如何调试scrapy丢失的请求 [英] Scrapy: how to debug scrapy lost requests

查看:75
本文介绍了Scrapy:如何调试scrapy丢失的请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个爬虫蜘蛛,但它有时不返回请求.

I have a scrapy spider, but it doesn't return requests sometimes.

我发现通过在产生请求之前和获得响应之后添加日志消息.

I've found that by adding log messages before yielding request and after getting response.

Spider 可以遍历页面并解析每个页面上的项目抓取链接.

Spider has iterating over a pages and parsing link for item scrapping on each page.

这是部分代码

SampleSpider(BaseSpider):
    ....
    def parse_page(self, response):
        ...
        request = Request(target_link, callback=self.parse_item_general)
        request.meta['date_updated'] = date_updated
        self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
        yield request

    def parse_item_general(self, response):
        self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
        sel = Selector(response)
        ...

我比较了每条日志消息的数量,parse_item_general_send"比parse_item_general_recv"多

I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"

最终统计中没有 400 或 500 错误,所有响应状态代码只有 200.看起来请求只是消失了.

There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.

我还添加了这些参数以尽量减少可能的错误:

I've also added these parameters to minimize possible errors:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8

由于twisted的异步性质,我不知道如何调试这个bug.我发现了一个类似的问题:Python Scrapy 并不总是从网站下载数据,但它没有任何反应

Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response

推荐答案

开启,与Rho同注,可以添加设置

On, the same note as Rho, you can add the setting

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter' 

到您的settings.py",这将删除 url 缓存.这是一个棘手的问题,因为scrapy 日志中没有调试字符串告诉您何时使用缓存结果.

to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.

这篇关于Scrapy:如何调试scrapy丢失的请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆