Scrapy CrawlSpider 重试抓取 [英] Scrapy CrawlSpider retry scrape

查看:53
本文介绍了Scrapy CrawlSpider 重试抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我试图抓取的页面,我有时会在我的响应中返回一个占位符"页面,其中包含一些自动重新加载的 javascript,直到它获得真正的页面.我可以检测到这种情况何时发生,我想重新尝试下载和抓取页面.我在 CrawlSpider 中使用的逻辑类似于:

For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:

def parse_page(self, response):
    url = response.url

    # Check to make sure the page is loaded
    if 'var PageIsLoaded = false;' in response.body:
        self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
        yield Request(url, self.parse, dont_filter=True)
        return

    ...
    # Normal parsing logic

但是,似乎当重试逻辑被调用并发出新请求时,页面及其包含的链接不会被抓取或抓取.我的想法是,通过使用 CrawlSpider 用于应用爬网规则的 self.parsedont_filter=True,我可以避免重复过滤器.但是使用 DUPEFILTER_DEBUG = True,我可以看到重试请求被过滤掉了.

However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.

我是否遗漏了什么,或者有更好的方法来处理这个问题吗?如果可能的话,我想避免使用诸如飞溅之类的东西进行动态 js 渲染的复杂性,这只会间歇性地发生.

Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.

推荐答案

我会考虑有一个 自定义重试中间件 - 类似于 内置.

示例实现(未测试):

import logging

logger = logging.getLogger(__name__)


class RetryMiddleware(object):
    def process_response(self, request, response, spider):
        if 'var PageIsLoaded = false;' in response.body:
            logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
            return self._retry(request) or response

        return response

    def _retry(self, request):
        logger.debug("Retrying %(request)s", {'request': request})

        retryreq = request.copy()
        retryreq.dont_filter = True
        return retryreq

并且不要忘记激活它.

这篇关于Scrapy CrawlSpider 重试抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆