Scrapy:如何调试scrapy丢失的请求 [英] Scrapy: how to debug scrapy lost requests
问题描述
我有一个爬虫蜘蛛,但它有时不返回请求.
I have a scrapy spider, but it doesn't return requests sometimes.
我发现通过在产生请求之前和获得响应之后添加日志消息.
I've found that by adding log messages before yielding request and after getting response.
Spider 可以遍历页面并解析每个页面上的项目抓取链接.
Spider has iterating over a pages and parsing link for item scrapping on each page.
这是部分代码
SampleSpider(BaseSpider):
....
def parse_page(self, response):
...
request = Request(target_link, callback=self.parse_item_general)
request.meta['date_updated'] = date_updated
self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
yield request
def parse_item_general(self, response):
self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
sel = Selector(response)
...
我比较了每条日志消息的数量,parse_item_general_send"比parse_item_general_recv"多
I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"
最终统计中没有 400 或 500 错误,所有响应状态代码只有 200.看起来请求只是消失了.
There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.
我还添加了这些参数以尽量减少可能的错误:
I've also added these parameters to minimize possible errors:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8
由于twisted的异步性质,我不知道如何调试这个bug.我发现了一个类似的问题:Python Scrapy 并不总是从网站下载数据,但它没有任何反应
Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response
推荐答案
开启,与Rho同注,可以添加设置
On, the same note as Rho, you can add the setting
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
到您的settings.py",这将删除 url 缓存.这是一个棘手的问题,因为scrapy 日志中没有调试字符串告诉您何时使用缓存结果.
to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.
这篇关于Scrapy:如何调试scrapy丢失的请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!