重定向请求的回调 Scrapy [英] Callback for redirected requests Scrapy

查看:79
本文介绍了重定向请求的回调 Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用抓取框架进行抓取.一些请求被重定向,但 start_requests 中设置的回调函数不会为这些重定向的 url 请求调用,但对非重定向的请求工作正常.

I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones.

我在 start_requests 函数中有以下代码:

I have the following code in the start_requests function:

for user in users:
    yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p)

但是这个 self.parse_p 只被非 302 请求调用.

But this self.parse_p is called only for the Non-302 requests.

推荐答案

我猜你会收到最后一页的回调(在重定向之后).RedirectMiddleware 负责重定向.您可以禁用它,然后您必须手动执行所有重定向.如果您想选择性地禁用几种类型的请求的重定向,您可以这样做:

I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the RedirectMiddleware. You could disable it and then you would have to do all the redirects manually. If you wanted to selectively disable redirects for a few types of Requests you can do it like this:

request =  scrapy.Request(url, meta={'dont_redirect': True} callback=self.manual_handle_of_redirects)

我不确定中间请求/响应是否非常有趣.这也是 RedirectMiddleware 所相信的.因此,它会自动执行重定向并将中间 URL(唯一有趣的事情)保存在:

I'm not sure that the intermediate Requests/Responses are very interesting though. That's also what RedirectMiddleware believes. As a result, it does the redirects automatically and saves the intermediate URLs (the only interesting thing) in:

response.request.meta.get('redirect_urls')

您有几个选择!

示例蜘蛛:

import scrapy

class DimSpider(scrapy.Spider):
    name = "dim"

    start_urls = (
        'http://example.com/',
    )

    def parse(self, response):
        yield scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)

    def parse_p(self, response):
       print response.request.meta.get('redirect_urls')
       print "done!"

示例输出...

DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Redirecting (302) to <GET http://myredirect.com> from <GET http://example.com/redirect302.php>
DEBUG: Crawled (200) <GET http://myredirect.com/> (referer: http://example.com/redirect302.com/)
['http://example.com/redirect302.php']
done!

如果你真的想抓取302页,你必须明确允许它.例如在这里,我允许 302 并将 dont_redirect 设置为 True:

If you really want to scrape the 302 pages, you have to explicitcly allow it. For example here, I allow 302 and set dont_redirect to True:

handle_httpstatus_list = [302]
def parse(self, response):
    r = scrapy.Request(url="http://example.com/redirect302.php", dont_filter=True, callback=self.parse_p)
    r.meta['dont_redirect'] = True
    yield r

最终结果是:

DEBUG: Crawled (200) <GET http://www.example.com/> (referer: None)
DEBUG: Crawled (302) <GET http://example.com/redirect302.com/> (referer: http://www.example.com/)
None
done!

这个蜘蛛应该手动跟踪 302 网址:

This spider should manually follow 302 urls:

import scrapy

class DimSpider(scrapy.Spider):
    name = "dim"

    handle_httpstatus_list = [302]

    def start_requests(self):
        yield scrapy.Request("http://page_with_or_without_redirect.html",
                             callback=self.parse200_or_302, meta={'dont_redirect':True})

    def parse200_or_302(self, response):
        print "I'm on: %s with status %d" % (response.url, response.status)
        if 'location' in response.headers:
            print "redirecting"
            return [scrapy.Request(response.headers['Location'],
                                  callback=self.parse200_or_302, meta={'dont_redirect':True})]

小心点.不要省略设置 handle_httpstatus_list = [302] 否则你会得到HTTP status code is not handle or not allowed".

Be careful. Don't omit setting handle_httpstatus_list = [302] otherwise you will get "HTTP status code is not handled or not allowed".

这篇关于重定向请求的回调 Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆