无法让 Scrapy 解析并跟踪 301、302 重定向 [英] Can't get Scrapy to parse and follow 301, 302 redirects

查看:63
本文介绍了无法让 Scrapy 解析并跟踪 301、302 重定向的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个非常简单的网站爬虫来列出 URL 以及 200、301、302 和 404 http 状态代码的引用和状态代码.

I'm trying to write a very simple website crawler to list URLs along with referrer and status codes for 200, 301, 302 and 404 http status codes.

事实证明 Scrapy 运行良好,我的脚本正确使用它来抓取网站,并且可以毫无问题地列出带有 200 和 404 状态代码的网址.

Turns out that Scrapy works great and my script uses it correctly to crawl the website and can list urls with 200 and 404 status codes without problems.

问题是:我找不到如何让scrapy跟随重定向并解析/输出它们.我可以让一个工作,但不能两个都工作.

The problem is: I can't find how to have scrapy follow redirects AND parse/output them. I can get one to work but not both.

到目前为止我尝试过的:

What I've tried so far:

  • 设置 meta={'dont_redirect':True} 并设置 REDIRECTS_ENABLED = False

将 301、302 添加到 handle_httpstatus_list

adding 301, 302 to handle_httpstatus_list

更改重定向中间件文档中指定的设置

changing settings specified in the redirect middleware doc

阅读重定向中间件代码以获得洞察力

reading the redirect middleware code for insight

以上所有的各种组合

其他随机的东西

这是公共存储库,如果您想看一看代码.

Here's the public repo if you want to take a look at the code.

推荐答案

如果你想解析 301 和 302 响应,并同时关注它们,要求你的回调处理 301 和 302 并模仿行为重定向中间件.

If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.

让我们从一个简单的蜘蛛开始进行说明(尚未按您的预期工作):

Let's illustrate with a simple spider to start with (not working as you intend yet):

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    def parse(self, response):
        self.logger.info("got response for %r" % response.url)

现在,蜘蛛要求 2 页,第二页应该重定向到 http://www.example.com

Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com

$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)

302 由 RedirectMiddleware 自动处理,不会传递给您的回调.

The 302 is handled by RedirectMiddleware automatically and it does not get passed to your callback.

让我们在回调中配置蜘蛛以处理 301 和 302,使用handle_httpstatus_list:

Let's configure the spider to handle 301 and 302s in the callback, using handle_httpstatus_list:

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

让我们运行它:

$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)

在这里,我们缺少重定向.

Here, we're missing the redirection.

执行与重定向但中间件相同

Do the same as RedirectMiddleware but in the spider callback:

from six.moves.urllib.parse import urljoin

import scrapy
from scrapy.utils.python import to_native_str


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

        # do something with the response here...

        # handle redirection
        # this is copied/adapted from RedirectMiddleware
        if response.status >= 300 and response.status < 400:

            # HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
            location = to_native_str(response.headers['location'].decode('latin1'))

            # get the original request
            request = response.request
            # and the URL we got redirected to
            redirected_url = urljoin(request.url, location)

            if response.status in (301, 307) or request.method == 'HEAD':
                redirected = request.replace(url=redirected_url)
                yield redirected
            else:
                redirected = request.replace(url=redirected_url, method='GET', body='')
                redirected.headers.pop('Content-Type', None)
                redirected.headers.pop('Content-Length', None)
                yield redirected

再次运行蜘蛛:

$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)

我们被重定向到 http://www.example.com 并且我们还通过回调获得了响应.

We got redirected to http://www.example.com and we also got the response through our callback.

这篇关于无法让 Scrapy 解析并跟踪 301、302 重定向的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆