Scrapy 重试或重定向中间件 [英] Scrapy retry or redirect middleware

查看：36 发布时间：2021/6/26 18:53:57 python redirect python-2.7 scrapy

本文介绍了Scrapy 重试或重定向中间件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用scrapy 抓取网站时，我大约有 1/5 的时间被重定向到用户阻止的页面.发生这种情况时，我会丢失从重定向的页面.我不知道要使用哪个中间件或在该中间件中使用什么设置，但我想要这个:

While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happens. I don't know which middleware to use or what settings to use in that middleware, but I want this:

调试:从 (GET http://domain.com/bar.htm) 重定向 (302) 到 (GET http://domain.com/foo.aspx)

不要删除 bar.htm.当刮板完成时，我最终没有来自 bar.htm 的数据，但我正在旋转代理，所以如果它再次尝试 bar.htm(可能再试几次)，我应该得到它.我该如何设置尝试次数?

To NOT drop bar.htm. I end up with no data from bar.htm when the scraper's done, but I'm rotating proxies, so if it tries bar.htm again (maybe a few more times), I should get it. How do I set the number of tries for that?

如果重要的话，我只允许爬虫使用一个非常具体的起始 url，然后只关注下一页"链接，所以它应该按顺序浏览少量页面 - 这就是为什么我需要它要么重试，例如，第 34 页，要么稍后再返回.Scrapy 文档说默认情况下它应该重试 20 次，但我根本没有看到它重试.此外，如果有帮助:所有重定向都转到同一页面(离开"页面，上面的 foo.com) - 有没有办法告诉 Scrapy 该特定页面不计算在内"，如果它被重定向到那里，继续重试?我在下载器中间件中看到了一些引用列表中特定 http 代码的内容 - 我可以以某种方式将 302 添加到始终尝试这个"列表中吗?

If it matters, I'm only allowing the crawler to use a very specific starting url and then only follow "next page" links, so it should go in order through a small number of pages - hence why I need it to either retry, e.g., page 34, or come back to it later. Scrapy documentation says it should retry 20 times by default, but I don't see it retrying at all. Also if it helps: All redirects go to the same page (a "go away" page, the foo.com above) - is there a way to tell Scrapy that that particular page "doesn't count" and if it's getting redirected there, to keep retrying? I saw something in the downloader middleware referring to particular http codes in a list - can I add 302 to the "always keep trying this" list somehow?

推荐答案

今天我在使用 301..303 重定向的网站上遇到了同样的问题，但有时也会出现元重定向.我已经构建了一个重试中间件并使用了来自的一些块a> 中间件:

I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_meta_refresh
from scrapy import log

class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        url = response.url
        if response.status in [301, 307]:
            log.msg("trying to redirect us: %s" %url, level=log.INFO)
            reason = 'redirect %d' %response.status
            return self._retry(request, reason, spider) or response
        interval, redirect_url = get_meta_refresh(response)
        # handle meta redirect
        if redirect_url:
            log.msg("trying to redirect us: %s" %url, level=log.INFO)
            reason = 'meta'
            return self._retry(request, reason, spider) or response
        hxs = HtmlXPathSelector(response)
        # test for captcha page
        captcha = hxs.select(".//input[contains(@id, 'captchacharacters')]").extract()
        if captcha:
            log.msg("captcha page %s" %url, level=log.INFO)
            reason = 'capcha'
            return self._retry(request, reason, spider) or response
        return response

为了使用这个中间件，最好在 settings.py 中禁用这个项目的现有重定向中间件:

In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:

DOWNLOADER_MIDDLEWARES = {
                         'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
                          'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
                          'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
}

这篇关于Scrapy 重试或重定向中间件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy 重试或重定向中间件 [英] Scrapy retry or redirect middleware

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 重试或重定向中间件 [英] Scrapy retry or redirect middleware

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭